Unicode Regular Expressions, Surrogate Points and UTF-8
markus.icu at gmail.com
Sun Jun 1 10:58:26 CDT 2014
On Sun, Jun 1, 2014 at 1:49 AM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:
> D80: Unicode string:
> A code unit sequence containing code units of a particular Unicode
> encoding form...
Right -- in a Unicode 16-bit string, you have a sequence of any 16-bit
value in any order. Well-formedness applies to UTF-x encoding forms.
It is common to not treat unpaired surrogates as errors because they behave
like "boring" code points, that is, they are "harmless". However, that does
not mean that they work like fully supported code points in all places,
just that where it's easier to treat them like harmless code points that's
often done. In ICU4C simple string functions, if you search for code point
0xd800 you will find it in a string if it occurs as an unpaired surrogate.
In ICU collation of 16-bit strings, an unpaired surrogate sorts with an
unassigned-implicit primary weight. (You can try this with the online
collation demo. In ICU UTF-8 collation, ill-formed sequences sort like
>> \uD808\uDF45 specifies a sequence of two
> >> codepoints.
> > Implementations that use Unicode 16-bit strings will usually treat
> > this as one supplementary code point.
> > In Java, there is no other way to escape one.
> In which case, Java does *not* supply 'basic Unicode support' as defined
> by UTS#18 Version 17 - see just before Section 1.1.1 therein. An
> engine that matches code unit by code unit does not comply with RL1.7.
You misunderstand. In Java, \uD808\uDF45 is the only way to escape a
supplementary code point, but as long as you have a surrogate pair, it is
treated as a code point in APIs that support them.
Java 5 upgraded the regular expression code to match code points, not code
units. I don't know what it does when the pattern contains an unpaired
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode