Unicode Regular Expressions, Surrogate Points and UTF-8

Mon Jun 2 15:01:53 CDT 2014

On Mon, 2 Jun 2014 11:29:09 +0200
Mark Davis ☕️ <mark at macchiato.com> wrote:

> > \uD808\uDF45 specifies a sequence of two codepoints.
> 
> That is simply incorrect.

The above is in the sample notation of UTS #18 Version 17 Section 1.1.

>From what I can make out, the corresponding Java notation would be
\x{D808}\x{DF45}.  I don't *know* what \x{D808} and \x{DF45} match in
Java, or whether they are even acceptable.  The only thing UTS #18
RL1.7 permits them to match in Java is lone surrogates, but I don't
know if Java complies.

All UTS #18 says for sure about regular expressions matching code units
is that they don't satisfy RL1.1, though Section 1.7 appears to ban
them when it says, "A fundamental requirement is that Unicode text be
interpreted semantically by code point, not code units".  Perhaps it's
a fundamental requirement of something other than UTS #18.  I thought
matching parts of characters in terms of their canonical equivalences
was awkward enough, without having the additional option of matching
some of the code units!

Richard.