Unicode Regular Expressions, Surrogate Points and UTF-8

Sun Jun 1 12:04:57 CDT 2014

On Sun, 1 Jun 2014 08:58:26 -0700
Markus Scherer <markus.icu at gmail.com> wrote:

> You misunderstand. In Java, \uD808\uDF45 is the only way to escape a
> supplementary code point, but as long as you have a surrogate pair,
> it is treated as a code point in APIs that support them.

Wasn't obvious that in the following paragraph \uD808\uDF45 was a
pattern?

"Bear in mind that a pattern \uD808 shall not match anything in a
well-formed Unicode string. \uD808\uDF45 specifies a sequence of two
codepoints. This sequence can occur in an ill-formed UTF-32 Unicode
string and before Unicode 5.2 could readily be taken to occur in an
ill-formed UTF-8 Unicode string.  RL1.7 declares that for a regular
expression engine, the codepoint sequence <U+D808, U+DF45> cannot
occur in a UTF-16 Unicode string; instead, the code unit sequence <D808
DF45> is the codepoint sequence <U+12345 CUNEIFORM SIGN URU TIMES
KI>."

(It might have been clearer to you if I'd said '8-bit' and '16-bit'
instead of UTF-8 and UTF-16.  It does make me wonder what you'd call a
16-bit encoding of arbitrary *codepoint* sequences.)

Richard.