Surrogates and noncharacters
verdy_p at wanadoo.fr
Sun May 10 00:42:14 CDT 2015
OK, but D80 and D82 have no purpose, except adding the term "Unicode"
redundantly to these expressions.
- D80 defines "Unicode string" but in fact it just defines a generic
"string" as an arbitrary stream of fixed-size code units. This is the basic
definition applicable to all languages I've seen (even if they add
additional properties or methods in OOP). It is the same as a C/C++ string
(if we ignore the additonal convention of using null as a terminator,
soething that is not required in the language, but only a convention of its
oldest standard libraries; newer libraries encode length separately)
- D82 defines "Unicode 16-bit string" but in fact it just defines a generic
"16-bit string" as an arbitrary stream of 16-bit code units. This is
not requiring the null-byte termination but storing the length as an
These two rules are not productive at all, except for saying that all
values of fixed size code units are acceptable (including for example 0xFF
in 8-bit strings, which is invalid in UTF-8)
Curiously D80 and D82 just restrict themselves to bounded strings (with a
defined length), instead of streams (with undetermined length, no start
index, no absolute position, no terminator, but just a special distinct
value returned for EOF or a method to query the current termination state
of the stream, which may be time-dependant).
However I wonder what would be the effect of D80 in UTF-32: is <0xFFFFFFFF>
a valid "32-bit string" ? After all it is also containing a single 32-bit
code unit (for at least one Unicode encoding form), even if it has no
"scalar value" and then does not have to validate D89 (for UTF-32)...
If there are confusions in other documents, it's now probably because of
the completely unproductive D80 and D82 definitions of specific terms
(which are probably not definitions of terms but just fixing the needed
local context in order to define D89). the two rules D80 and D82 have
absolutely no use in TUS outside D89. So D80 and D82 are probaly excessive
definitions, D89 would be enough (TUS shoukd not have to dictate other
lower-level behavior to programming environments or protocols)
2015-05-09 17:51 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:
> On Sat, 9 May 2015 16:54:30 +0200
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> > 2015-05-09 16:26 GMT+02:00 Richard Wordingham <
> > richard.wordingham at ntlworld.com>:
> > > In particular, I claim that all 6 permutations of <D800, 0054, DCC1>
> > > are Unicode strings, but that only two, namely <D800, DCC1, 0054>
> > > and <0054, D800, DCC1>, are UTF-16 strings.
> > >
> > Again you use "Unicode strings" for your 6 permutations, but in your
> > example they have nothing that make them "Unicode strings", given you
> > allow arbitrary code units in arbitrary order, including unpaired
> > ones. The 6 permutations are just "16-bit strings" (addding "Unicode"
> > for these 6 permutations gives absolutely no value if you keep your
> > definition, but visibly it cannot fit with the term used in the RFC
> > trying to normalize JSON, with similar confusions !).
> > TUS does not define what is a "Unicode string" like you do here.
> D80 _Unicode string:_ A code unit sequence containing code units of
> a particular Unicode encoding form
> RW: Note that by this definition, a permutation of a Unicode string is
> a Unicode string.
> D82 _Unicode 16-bit string:_ A Unicode string containing only UTF-16
> code units.
> D85 _Well-formed:_ A Unicode code unit sequence that purports to be
> in a Unicode encoding form is called well-formed if and only if it
> _does_ follow the specification of that Unicode encoding form
> D89 _In a Unicode encoding form:_ A Unicode string is said to be in
> a particular Unicode encoding form if and only if it consists of a
> well-formed Unicode code unit sequence of that Unicode encoding form.
> • A Unicode string consisting of a well-formed UTF-8 code unit
> sequence is said to be _in UTF-8_. Such a Unicode string is referred to
> as a _valid UTF-8 string_, or a _UTF-8 string_ for short.
> • A Unicode string consisting of a well-formed UTF-16 code unit
> sequence is said to be _in UTF-16_. Such a Unicode string is referred to
> as a _valid UTF-16 string_, or a _UTF-16 string_ for short.
> • A Unicode string consisting of a well-formed UTF-32 code unit
> sequence is said to be _in UTF-32_. Such a Unicode string is referred to
> as a _valid UTF-32 string_, or a _UTF-32 string_ for short.
> > TUS just defines "Unicode 16-bit strings" with a direct reference to
> > UTF-16 (which implies conformance and only accepts the later two
> > strings, that TUS names "Unicode 16-bit strings", not "UTF-16
> > strings"...)
> Look at D82 again. It refers to UTF-16 code units and does not
> otherwise reference UTF-16.
> If you still do not believe me, consider D89. Can you think of an
> example of a Unicode string consisting of UTF-8 code units, UTF-16
> code units or UTF-32 code units that is not a UTF-8 string, not a
> UTF-16 and is not a UTF-32 string? If you can't, the use of
> "well-formed" is curiously redundant in D89.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode