Unicode Regular Expressions, Surrogate Points and UTF-8

Sat May 31 03:59:58 CDT 2014

On Fri, 30 May 2014 16:15:12 -0700
Markus Scherer <markus.icu at gmail.com> wrote:

> If you find UTS #18 unclear or misleading, I suggest you submit
> feedback pointing out specific text issues.

In this case it seems to be making a pointless, counter-productive
'demand'.  I'm first raising the point here in case there is a good
reason for the requirement.

> If you use Unicode 16-bit strings, it's easy to "pass through"
> unpaired surrogates and treat them like code points; it's often not
> productive or necessary to check for them all the time, that is, to
> be strict about UTF-16.

Bear in mind that a pattern \uD808 shall not match anything in a
well-formed Unicode string.  \uD808\uDF45 specifies a sequence of two
codepoints.  This sequence can occur in an ill-formed UTF-32 Unicode
string and before Unicode 5.2 could readily be taken to occur in an
ill-formed UTF-8 Unicode string.  RL1.7 declares that for a regular
expression engine, the codepoint sequence <U+D808, U+DF45> cannot
occur in a UTF-16 Unicode string; instead, the code unit sequence <D808
DF45> is the codepoint sequence <U+12345 CUNEIFORM SIGN URU TIMES
KI>.

> On the other hand, I don't think anyone expects you to support invalid
> UTF-8, and especially not to support any and all Unicode 8-bit
> strings (see Unicode 3.9 Unicode Encoding Forms for what I mean here).

Is there a use case for having a 1-character RE \uD800 within a
pattern? I could understand it if it were required to match a lone
surrogate U+D800, but it isn't.  If a regular expression engine
matches lone surrogates, then using the same notation for all
codepoints is reasonable, but if it doesn't it would be more useful for
it to treat lone surrogate code points in patterns as errors.

Richard.