Unicode Regular Expressions, Surrogate Points and UTF-8

Mon Jun 2 04:29:09 CDT 2014

> \uD808\uDF45 specifies a sequence of two codepoints.

That is simply incorrect.

In Java (and similar environments), \uXXXX means a char (a UTF16 code
unit), not a code point. Here is the difference. If you are not used to
Java, string.replaceAll(x,y) uses Java's regex to replace the pattern x
with the replacement y in string. Backslashes in literals need escaping, so
\x needs to be written in literals as \\x.

    String[] tests = {"\\x{12345}", "\\uD808\\uDF45", "\uD808\uDF45",
"«.»"};
    String target =
     "one: «\uD808\uDF45»\t\t" +
    "two: «\uD808\uDF45\uD808\uDF45»\t\t" +
    "lead: «\uD808»\t\t" +
    "trail: «\uDF45»\t\t" +
    "one+: «\uD808\uDF45\uD808»";
    System.out.println("pattern" + "\t→\t" + target + "\n");
    for (String test : tests) {
      System.out.println(test + "\t→\t" + target.replaceAll(test, "§︎"));
    }

*Output:*
pattern → one: «��» two: «����» lead: «?» trail: «?» one+: «��?»

\x{12345} → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?»
\uD808\uDF45 → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?»
�� → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?»
«.» → one: §︎ two: «����» lead: §︎ trail: §︎ one+: «��?»

The target has various combinations of code units, to see what happens.
Notice that Java treats a pair of lead+trail as a single code point for
matching (eg .), but also an isolated surrogate char as a single code point
(last line of output). Note that Java's regex in addition allows \x{hex}
for specifying a code point explicitly. It also has the syntax \uXXXX (in a
literal the \ needs escaping) to specify a code unit; that is slightly
different than the Java preprocessing. Thus the first two are equivalent,
and replace "{" by "x". The last two are also equivalent—and fail—because a
single "{" is a broken regex pattern.

    System.out.println("{".replaceAll("\\u007B", "x"));
    System.out.println("{".replaceAll("\\x{7B}", "x"));

    System.out.println("{".replaceAll("\u007B", "x"));
    System.out.println("{".replaceAll("{", "x"));

Mark <https://google.com/+MarkDavis>

 *— Il meglio è l’inimico del bene —*

On Sun, Jun 1, 2014 at 7:04 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> On Sun, 1 Jun 2014 08:58:26 -0700
> Markus Scherer <markus.icu at gmail.com> wrote:
>
> > You misunderstand. In Java, \uD808\uDF45 is the only way to escape a
> > supplementary code point, but as long as you have a surrogate pair,
> > it is treated as a code point in APIs that support them.
>
> Wasn't obvious that in the following paragraph \uD808\uDF45 was a
> pattern?
>
> "Bear in mind that a pattern \uD808 shall not match anything in a
> well-formed Unicode string. \uD808\uDF45 specifies a sequence of two
> codepoints. This sequence can occur in an ill-formed UTF-32 Unicode
> string and before Unicode 5.2 could readily be taken to occur in an
> ill-formed UTF-8 Unicode string.  RL1.7 declares that for a regular
> expression engine, the codepoint sequence <U+D808, U+DF45> cannot
> occur in a UTF-16 Unicode string; instead, the code unit sequence <D808
> DF45> is the codepoint sequence <U+12345 CUNEIFORM SIGN URU TIMES
> KI>."
>
> (It might have been clearer to you if I'd said '8-bit' and '16-bit'
> instead of UTF-8 and UTF-16.  It does make me wonder what you'd call a
> 16-bit encoding of arbitrary *codepoint* sequences.)
>
> Richard.
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140602/bad023e7/attachment.html>