Surrogates and noncharacters
richard.wordingham at ntlworld.com
Sat May 9 10:51:21 CDT 2015
On Sat, 9 May 2015 16:54:30 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 2015-05-09 16:26 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> > In particular, I claim that all 6 permutations of <D800, 0054, DCC1>
> > are Unicode strings, but that only two, namely <D800, DCC1, 0054>
> > and <0054, D800, DCC1>, are UTF-16 strings.
> Again you use "Unicode strings" for your 6 permutations, but in your
> example they have nothing that make them "Unicode strings", given you
> allow arbitrary code units in arbitrary order, including unpaired
> ones. The 6 permutations are just "16-bit strings" (addding "Unicode"
> for these 6 permutations gives absolutely no value if you keep your
> definition, but visibly it cannot fit with the term used in the RFC
> trying to normalize JSON, with similar confusions !).
> TUS does not define what is a "Unicode string" like you do here.
D80 _Unicode string:_ A code unit sequence containing code units of
a particular Unicode encoding form
RW: Note that by this definition, a permutation of a Unicode string is
a Unicode string.
D82 _Unicode 16-bit string:_ A Unicode string containing only UTF-16
D85 _Well-formed:_ A Unicode code unit sequence that purports to be
in a Unicode encoding form is called well-formed if and only if it
_does_ follow the specification of that Unicode encoding form
D89 _In a Unicode encoding form:_ A Unicode string is said to be in
a particular Unicode encoding form if and only if it consists of a
well-formed Unicode code unit sequence of that Unicode encoding form.
• A Unicode string consisting of a well-formed UTF-8 code unit
sequence is said to be _in UTF-8_. Such a Unicode string is referred to
as a _valid UTF-8 string_, or a _UTF-8 string_ for short.
• A Unicode string consisting of a well-formed UTF-16 code unit
sequence is said to be _in UTF-16_. Such a Unicode string is referred to
as a _valid UTF-16 string_, or a _UTF-16 string_ for short.
• A Unicode string consisting of a well-formed UTF-32 code unit
sequence is said to be _in UTF-32_. Such a Unicode string is referred to
as a _valid UTF-32 string_, or a _UTF-32 string_ for short.
> TUS just defines "Unicode 16-bit strings" with a direct reference to
> UTF-16 (which implies conformance and only accepts the later two
> strings, that TUS names "Unicode 16-bit strings", not "UTF-16
Look at D82 again. It refers to UTF-16 code units and does not
otherwise reference UTF-16.
If you still do not believe me, consider D89. Can you think of an
example of a Unicode string consisting of UTF-8 code units, UTF-16
code units or UTF-32 code units that is not a UTF-8 string, not a
UTF-16 and is not a UTF-32 string? If you can't, the use of
"well-formed" is curiously redundant in D89.
More information about the Unicode