Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)
richard.wordingham at ntlworld.com
Fri May 8 22:13:52 CDT 2015
On Sat, 9 May 2015 02:26:59 +0200
Daniel Bünzli <daniel.buenzli at erratique.ch> wrote:
> Le samedi, 9 mai 2015 à 00:37, Doug Ewell a écrit :
> > Noncharacters are Unicode scalar values,
> (However noncharacters are not designed to be openly interchanged see
> "Restricted interchange" on p. 31. of 7.0.0)
That didn't stop their being openly interchanged.
> > They may both be part of a "Unicode string" which does not claim to
> > be in any given encoding form.
> Not sure what you mean by that. So I let someone else answer.
There are a number of phrases whose declared meanings cannot be
deduced from the individual words. A UTF-8, UTF-16 or UTF-32 string
defines a sequence of scalar values. However, Unicode 8-bit, 16-bit
or 32-bit string is merely a sequence of 8-bit, 16-bit or 32-bit
values that may occur in a UTF-8, UTF-16 or UTF-32 string
respectively. This definition has some odd consequences:
A Unicode 32-bit string is a UTF-32 string, for UTF-32 is not a
multi-word encoding. An arbitrary string of unsigned 32-bit values is
not in general a Unicode 32-bit string.
All strings of unsigned 16-bit values are Unicode 16-bit strings. Not
all (Unicode) 16-bit strings are UTF-16 strings.
Not all strings of unsigned 8-bit values are Unicode 8-bit strings, and
not all Unicode 8-bit strings are UTF-8 strings.
I can't think of a practical use for the specific concepts of Unicode
8-bit, 16-bit and 32-bit strings. Unicode 16-bit strings are
essentially the same as 16-bit strings, and Unicode 32-bit strings are
UTF-32 strings. 'Unicode 8-bit string' strikes me as an exercise in
pedantry; there are more useful categories of 8-bit strings that are
not UTF-8 strings.
More information about the Unicode