Unicode Regular Expressions, Surrogate Points and UTF-8

Philippe Verdy verdy_p at wanadoo.fr
Sat May 31 08:08:59 CDT 2014


I just spoke about stripping a single leading U+1FFFE in such context of
autodetecting UTF-8 vs. CESU-8 where a U+FFFE leading BOM is not reliable
enough (if the non-BMP characters are not found in the first few KB of
data), or where the presence of a non-BMP character in that leading few KB
could cause UTF-8 ro be rejected, but CESU-8 still not selected as a
candidate.

However CESU-8 is rarely used except fro compatibility with some old
processes built initially for handling only characters in the BMP and
treating surrogates as if they were characters, if those processes cannot
accept 4-byte UTF-8 encoded sequences.

CESU-8 is a legacy, UTF-8 is far better and now well supported in most OSes
and "Unicode" libraries and most old protocols (plus all new ones).

Insertng the special BOM for CESU-8 on output is possible, while also
forcing strupping it on input.

And this has nothing in common with collation. Collations are *not*
encodings even if internally they may reencode texts during preprocessing
within a private interface between layers of the collation process (no need
of an interfchanged agreement, these internal steps are mutually bound
together and not easily separable (except the leading normalization step,
but the most efficient collators do not separate these steps, they pileline
them in a finite state automata to reduce the use of multiple buffers and
maximize the data locality within a small set of state variables and code
branches).

2014-05-31 14:11 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Sat, 31 May 2014 13:21:23 +0200
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> > However CESU-8 can be detected by the initial encoding of another byte
> > order mark U+1FFFE (which is a non-character that MUST be stripped
> > once detected from the parsed stream of code points) However,
> > documents starting by this non-cahracters are supposed to be
> > non-interoperable by definition even though the presence of that
> > special byte order mark would be very safe to secure CESU-8 and
> > discriminate it from UTF-8.
>
> Where is this tagging defined?
>
> It is in general not true that non-characters must be stripped on
> input.  That would be highly inappropriate in a conversion program that
> transformed between UTFs.  Also, the collations defined in CLDR Version
> 23 file collation/zh.xml would be severely damaged if the
> non-characters were stripped out.  In version 24 and later the file
> uses a different syntax and doesn't contain non-characters.
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140531/eb2cd6ac/attachment.html>


More information about the Unicode mailing list