Unicode Regular Expressions, Surrogate Points and UTF-8

Markus Scherer markus.icu at gmail.com
Sat May 31 21:28:27 CDT 2014


On Sat, May 31, 2014 at 1:59 AM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> Bear in mind that a pattern \uD808 shall not match anything in a
> well-formed Unicode string.


Depends. See the definitions of Unicode strings vs. UTF strings.

\uD808\uDF45 specifies a sequence of two
> codepoints.


Implementations that use Unicode 16-bit strings will usually treat this as
one supplementary code point.
In Java, there is no other way to escape one.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140531/e53770e7/attachment.html>


More information about the Unicode mailing list