Corrigendum #9

Philippe Verdy verdy_p at
Tue Jun 3 09:20:35 CDT 2014

I think his point is that an application may want to encapsulate in a valid
text any orbitrary stream of code points (including non characters, PUAs,
or isolated surrogate code units found in 16-bit or 32-bit streams that are
invalid UTF-16 or UTF-32 streams, or even invalid arbitrary 8-but bytes in
streams that are not valid UTF-8).

For 8-bit streams, using ESC or \ s generally a good choice of escape to
derive a valid UTF-8 text stream. But for 16-bit and 32-bit stream, PUAs
are more economical (but PUA code units found in the stream still need to
be escaped.

If you think about the Java regexp "\\uD800", it does not designates a code
point but only a code unit which is not valid plain text alone as it
violates UTF-16 encoding rules. Trying to match it in a valid UTF-16 stream
can work only if you can reprecent isolated code units for a specific
encoding like UTF-16, even if the targer stream to look for this match uses
any other valid UTF (not necessarily UTF-16: decode the target text,
reencode it to UTF-16 to generate a 16-bit stream in which you'll look for
isolated 16-but code units with the regexp)

So yes the regexp "\\uXXXX" (in Java source) is not used to match a single
valid character

2014-06-03 8:21 GMT+02:00 David Starner <prosfilaes at>:

> On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham
> <richard.wordingham at> wrote:
> > Much as I don't like their uninvited use, it is possible to pass them
> > and other undesirables through most applications by a slight bit of
> > recoding at the application's boundaries.  Using 99 = (3 + 32 + 64) PUA
> > characters, one can ape UTF-16 surrogates and encode:
> What's the point? If we can use the PUA, then we don't need the
> noncharacters; we can just use the PUA directly. If we have to play
> around with remapping them, they're pointless; they're no easier to
> use in that case then ESC or '\' or PUA characters.
> --
> Kie ekzistas vivo, ekzistas espero.
> _______________________________________________
> Unicode mailing list
> Unicode at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list