Unicode Regular Expressions, Surrogate Points and UTF-8

Sun Jun 1 03:49:31 CDT 2014

On Sat, 31 May 2014 19:28:27 -0700
Markus Scherer <markus.icu at gmail.com> wrote:

> On Sat, May 31, 2014 at 1:59 AM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:
> 
> > Bear in mind that a pattern \uD808 shall not match anything in a
> > well-formed Unicode string.
> 
> 
> Depends. See the definitions of Unicode strings vs. UTF strings.

D80: Unicode string:
A code unit sequence containing code units of a particular Unicode
encoding form...

D85 Well-formed:
A Unicode code unit sequence that purports to be in a Unicode encod-
ing form is called well-formed if and only if it does follow the
specification of that Unicode encoding form.

How does a Unicode string purport anything?

>> \uD808\uDF45 specifies a sequence of two
>> codepoints.

> Implementations that use Unicode 16-bit strings will usually treat
> this as one supplementary code point.
> In Java, there is no other way to escape one.

In which case, Java does *not* supply 'basic Unicode support' as defined
by UTS#18 Version 17 - see just before Section 1.1.1 therein.  An
engine that matches code unit by code unit does not comply with RL1.7.
This makes sense in so far as it provides for consistent results across
UTF-encodings for Unicode strings that could once have been reversibly
converted.  (A 32-bit Unicode string <D808, DF45> converted to
a 16-bit Unicode string and back would become <12345>.)  Now that that
conversion should not preserve lone surrogates (separately both C10
together with D93 and TUS Section 5.22), it makes less sense.

However, I can think of one major objection to a regular
expression engine using 16-bit Unicode strings treating every
supplementary point as a sequence of two surrogate points.  While it
might be acceptable for a lone surrogate to match \P{L} (codepoints
that are not letters), it would not be acceptable for every
supplementary point to match \P{L}\P{L} or even \p{Any}\p{Any}.

Richard.