Surrogates and noncharacters

Richard Wordingham richard.wordingham at ntlworld.com
Sat May 9 10:51:21 CDT 2015


On Sat, 9 May 2015 16:54:30 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-05-09 16:26 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > In particular, I claim that all 6 permutations of <D800, 0054, DCC1>
> > are Unicode strings, but that only two, namely <D800, DCC1, 0054>
> > and <0054, D800, DCC1>, are UTF-16 strings.
> >
> 
> Again you use "Unicode strings" for your 6 permutations, but in your
> example they have nothing that make them "Unicode strings", given you
> allow arbitrary code units in arbitrary order, including unpaired
> ones. The 6 permutations are just "16-bit strings" (addding "Unicode"
> for these 6 permutations gives absolutely no value if you keep your
> definition, but visibly it cannot fit with the term used in the RFC
> trying to normalize JSON, with similar confusions !).

> TUS does not define what is a "Unicode string" like you do here.

D80    _Unicode string:_  A code unit sequence containing code units of
a particular Unicode encoding form

RW: Note that by this definition, a permutation of a Unicode string is
a Unicode string.

D82    _Unicode 16-bit string:_  A Unicode string containing only UTF-16
code units.

D85    _Well-formed:_  A Unicode code unit sequence that purports to be
in a Unicode encoding form is called well-formed if and only if it
_does_ follow the specification of that Unicode encoding form

D89    _In a Unicode encoding form:_ A Unicode string is said to be in
a particular Unicode encoding form if and only if it consists of a
well-formed Unicode code unit sequence of that Unicode encoding form.
•   A Unicode string consisting of a well-formed UTF-8 code unit
sequence is said to be _in UTF-8_. Such a Unicode string is referred to
as a _valid UTF-8 string_, or a _UTF-8 string_ for short.
•   A Unicode string consisting of a well-formed UTF-16 code unit
sequence is said to be _in UTF-16_. Such a Unicode string is referred to
as a _valid UTF-16 string_, or a _UTF-16 string_ for short.
•   A Unicode string consisting of a well-formed UTF-32 code unit
sequence is said to be _in UTF-32_. Such a Unicode string is referred to
as a _valid UTF-32 string_, or a _UTF-32 string_ for short.

> TUS just defines "Unicode 16-bit strings" with a direct reference to
> UTF-16 (which implies conformance and only accepts the later two
> strings, that TUS names "Unicode 16-bit strings", not "UTF-16
> strings"...)

Look at D82 again.  It refers to UTF-16 code units and does not
otherwise reference UTF-16.

If you still do not believe me, consider D89.  Can you think of an
example of a Unicode string consisting of UTF-8 code units, UTF-16
code units or UTF-32 code units that is not a UTF-8 string, not a
UTF-16 and is not a UTF-32 string?  If you can't, the use of
"well-formed" is curiously redundant in D89.

Richard.



More information about the Unicode mailing list