Unicode Regular Expressions, Surrogate Points and UTF-8

Tue Jun 3 17:06:30 CDT 2014

On 06/02/2014 01:01 PM, Richard Wordingham wrote:
> On Mon, 2 Jun 2014 11:29:09 +0200
> Mark Davis ☕️<mark at macchiato.com>  wrote:
>
>>> \uD808\uDF45 specifies a sequence of two codepoints.
>> That is simply incorrect.
> The above is in the sample notation of UTS #18 Version 17 Section 1.1.
>
>  From what I can make out, the corresponding Java notation would be
> \x{D808}\x{DF45}.  I don't *know* what \x{D808} and \x{DF45} match in
> Java, or whether they are even acceptable.  The only thing UTS #18
> RL1.7 permits them to match in Java is lone surrogates, but I don't
> know if Java complies.

The notation for "\uD808\uDF45" is interpreted as a supplementary codepoint and
is represent internally as a pair of surrogates in String.

   Pattern.compile("\\x{D808}\\x{DF45}").matcher("\ud808\udf45").find());  -> false
   Pattern.compile("\uD808\uDF45").matcher("\ud808\udf45").find());        -> true
   Pattern.compile("\\x{D808}").matcher("\ud808\udf45").find());           -> false
   Pattern.compile("\\x{D808}").matcher("\ud808_\udf45").find());          -> true

-Sherman

> All UTS #18 says for sure about regular expressions matching code units
> is that they don't satisfy RL1.1, though Section 1.7 appears to ban
> them when it says, "A fundamental requirement is that Unicode text be
> interpreted semantically by code point, not code units".  Perhaps it's
> a fundamental requirement of something other than UTS #18.  I thought
> matching parts of characters in terms of their canonical equivalences
> was awkward enough, without having the additional option of matching
> some of the code units!
>