Unicode Regular Expressions, Surrogate Points and UTF-8

Philippe Verdy verdy_p at wanadoo.fr
Wed Jun 4 03:10:52 CDT 2014


It does match in a 16-bit "Unicode" string, but this is not a "UTF-16"
string : there's no such thing as a "16-bit string" in Unicode if you omit
to specify the exact UTF encoding type specified in the standard.

- the Java regex "\\x{0020}" (here in Java-source litteral String format
which requires escaping the backslash for that regexp literal) is not
contextual: it matches exactly one 16-bit char '\u0020' independantly of
its context.

- the Java regex "\\x{DC00}" (here in Java-source litteral String
format) is contextual: it really matches one 16-bit char '\uDC00' either at
*start* of the String or NOT immediately preceded by a 16-bit char between
'\uD800' and '\uDBFF'.
- the Java regex "\\uDC00" (here in Java-source litteral String format) is
NOT contextual: it really matches one 16-bit char '\uDC00' in all contexts,
so it is the same as the Java regexp "\uDC00" (because this single
surrogate char has no "special" meaning in regexps and is interpreted
literally by the regexp engine)

- the Java regex "\\x{D808}" (here in Java-source litteral String
format) is contextual: it really matches one 16-bit char '\uD808' either at
*end* of the String or NOT immediately followed by a 16-bit char between
'\uDC00' and '\uDFFF'.
- the Java regex "\\uD808" (here in Java-source litteral String format) is
NOT contextual: it really matches one 16-bit char '\uDC00' in all contexts,
so it is the same as the Java regexp "\uD808" (because this single
surrogate char has no "special" meaning in regexps and is interpreted
literally by the regexp engine)

In summary, the regexp engine in Java does not really work with code
points, it works directly at the code unit level. The \x notation is a
convenient shortcut to specify contexts for litteral codeunits, or to
escape the special meaning of some regexp operators.

Another example: the Java regexp "A*" is exactly identical to
"\u0041\u002A", in both cases this means 0 or more Latin capital letter A

(the \u notation in Java source code does not escape the special meaning
for regexps at runtime, it is a convenience only for the source code, for
example to escape a litteral double quote in a litteral String (note that
Java source code files may be be encoded in any text encoding supported by
its internationalisation library accessible to the Java compiler, for
example the Java source code could be using only US-ASCII or Windows-1252
and there's no otherway than the \u notation to compile a 16-bit char code
unit in a String literal if the needed character is absent from the Java
source code encoding; Java source code may also be encoded with UTF-8 in
which case most uses of \u is not needed  in Java you can as well use the
\u notation for identifiers, or for operators of the language !

The \u notation in source Java code is in fact interpreted AFTER it has
been generated by the source code reader according to its specified source
encoding. Then the decoded source string (internally represented in a Java
16-bit char[] array) is processed by the input stage of the lexer which
will convert these \u notation, prior to recognizing the lexical items.

There are quite similar input stages in ANSI C/C++ compilers.

For example ANSI C supports since long the "???" trigram prefix for noting
some standard operators or delimiters of the language if the characters
needed by its syntax is not supported in the source code encoding, and this
input stage also occurs prior to recognizing lexical entities of the
language, and it was used if the input encoding did not support the full
US-ASCII character set, but only the invariant subset of ISO 646, such as
old national versions of 7-bit EBCDIC or even the older 5-bit or 6-bit
encodings like Baudot ; very few C programmers know the existence of this
notation in ANSI C because today they only write code in files stored in an
encoding suporting at least the full US-ASCII subset (including one of the
many 8-bit EBCDIC variants remaining on mainframes or when working on
source code via old "exotic" 7-bit terminals, or if their national keyboard
don't define a way to enter the full US-ASCII graphic set, such as braces
or backslashes)...



2014-06-04 1:40 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Tue, 03 Jun 2014 15:06:30 -0700
> Xueming Shen <xueming.shen at oracle.com> wrote:
>
> > On 06/02/2014 01:01 PM, Richard Wordingham wrote:
> > > On Mon, 2 Jun 2014 11:29:09 +0200
> > > Mark Davis ☕️<mark at macchiato.com>  wrote:
> > >
> > >>> \uD808\uDF45 specifies a sequence of two codepoints.
> > >> ​That is simply incorrect.​
> > > The above is in the sample notation of UTS #18 Version 17 Section
> > > 1.1.
> > >
> > >  From what I can make out, the corresponding Java notation would be
> > > \x{D808}\x{DF45}.  I don't *know* what \x{D808} and \x{DF45} match
> > > in Java, or whether they are even acceptable.  The only thing UTS
> > > #18 RL1.7 permits them to match in Java is lone surrogates, but I
> > > don't know if Java complies.
> >
> > The notation for "\uD808\uDF45" is interpreted as a supplementary
> > codepoint and is represent internally as a pair of surrogates in
> > String.
> >
> >    Pattern.compile("\\x{D808}\\x{DF45}").matcher("\ud808\udf45").find());
> > -> false
> > Pattern.compile("\uD808\uDF45").matcher("\ud808\udf45").find());
> > -> true
> > Pattern.compile("\\x{D808}").matcher("\ud808\udf45").find());
> > -> false
> > Pattern.compile("\\x{D808}").matcher("\ud808_\udf45").find());
> > -> true
>
> Thank you for providing examples confirming that what in the UTS #18
> *sample* notation would be written \uD808\uDF45, i.e. \x{D808}\x{DF45}
> in Java notation, matches nothing in any 16-bit Unicode string.
>
> Richard.
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140604/7fa0a8ef/attachment.html>


More information about the Unicode mailing list