Unicode Regular Expressions, Surrogate Points and UTF-8

Sat May 31 07:11:03 CDT 2014

On Sat, 31 May 2014 13:21:23 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> However CESU-8 can be detected by the initial encoding of another byte
> order mark U+1FFFE (which is a non-character that MUST be stripped
> once detected from the parsed stream of code points) However,
> documents starting by this non-cahracters are supposed to be
> non-interoperable by definition even though the presence of that
> special byte order mark would be very safe to secure CESU-8 and
> discriminate it from UTF-8.

Where is this tagging defined?

It is in general not true that non-characters must be stripped on
input.  That would be highly inappropriate in a conversion program that
transformed between UTFs.  Also, the collations defined in CLDR Version
23 file collation/zh.xml would be severely damaged if the
non-characters were stripped out.  In version 24 and later the file
uses a different syntax and doesn't contain non-characters. 

Richard.