Surrogates and noncharacters

Philippe Verdy verdy_p at
Sat May 9 09:54:30 CDT 2015

2015-05-09 16:26 GMT+02:00 Richard Wordingham <
richard.wordingham at>:

> In particular, I claim that all 6 permutations of <D800, 0054, DCC1>
> are Unicode strings, but that only two, namely <D800, DCC1, 0054> and
> <0054, D800, DCC1>, are UTF-16 strings.

Again you use "Unicode strings" for your 6 permutations, but in your
example they have nothing that make them "Unicode strings", given you allow
arbitrary code units in arbitrary order, including unpaired ones. The 6
permutations are just "16-bit strings" (addding "Unicode" for these 6
permutations gives absolutely no value if you keep your definition, but
visibly it cannot fit with the term used in the RFC trying to normalize
JSON, with similar confusions !).

TUS does not define what is a "Unicode string" like you do here.
TUS just defines "Unicode 16-bit strings" with a direct reference to UTF-16
(which implies conformance and only accepts the later two strings, that TUS
names "Unicode 16-bit strings", not "UTF-16 strings"...)

TUS goes further by then distinguishing its encoding schemes (taking into
account their serialization ti 8-bit streams, and also considering the byte
order, for defining the 3 supported UTF-16 encoding schemes: with or
without BOM): then an "UTF-16 string" become "UTF-16 encoded text" (or
UTF-16BE or UTF16-LE).

Note also that I used the term "stream" instead of "string" only to avoid
restricting the length (but JSON does not support encoding streams of
arbitrary lengths, all of them must have a start, an end, and a defined
bounded length (while streams don't necessarily have any defined length
property, independantly of the way we measure length: either in bytes, code
units, code points, combining sequences, grapheme clusters...).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list