Unicode Regular Expressions, Surrogate Points and UTF-8

Philippe Verdy verdy_p at wanadoo.fr
Sat May 31 06:21:23 CDT 2014


I think Richard dd not speak aout that, but about the behavior of a
matchier that would start parsing a text using the wrong guessed encoding.
e gave the exampe of a valid CESU-8 text containing with U+10000: when
reading it incorrectly as UTF-8, the parser gets the 4 invalid sequences:
CESU-8 cannot be easily detected at start of the stream with the encoding
of byte order mark U+FEFF.

However CESU-8 can be detected by the initial encoding of another byte
order mark U+1FFFE (which is a non-character that MUST be stripped once
detected from the parsed stream of code points) However, documents starting
by this non-cahracters are supposed to be non-interoperable by definition
even though the presence of that special byte order mark would be very safe
to secure CESU-8 and discriminate it from UTF-8.



2014-05-31 1:15 GMT+02:00 Markus Scherer <markus.icu at gmail.com>:

> If you use Unicode 16-bit strings, it's easy to "pass through" unpaired
> surrogates and treat them like code points; it's often not productive or
> necessary to check for them all the time, that is, to be strict about
> UTF-16.
>
> On the other hand, I don't think anyone expects you to support invalid
> UTF-8, and especially not to support any and all Unicode 8-bit strings (see
> Unicode 3.9 Unicode Encoding Forms for what I mean here).
>
> If you find UTS #18 unclear or misleading, I suggest you submit feedback
> pointing out specific text issues.
>
> markus
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140531/254f225c/attachment.html>


More information about the Unicode mailing list