Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

Fri May 8 22:13:52 CDT 2015

On Sat, 9 May 2015 02:26:59 +0200
Daniel Bünzli <daniel.buenzli at erratique.ch> wrote:

> Le samedi, 9 mai 2015 à 00:37, Doug Ewell a écrit :
> > Noncharacters are Unicode scalar values,

> (However noncharacters are not designed to be openly interchanged see
> "Restricted interchange" on p. 31. of 7.0.0)

That didn't stop their being openly interchanged.

> > They may both be part of a "Unicode string" which does not claim to
> > be in any given encoding form.

> Not sure what you mean by that. So I let someone else answer.  

There are a number of phrases whose declared meanings cannot be
deduced from the individual words.  A UTF-8, UTF-16 or UTF-32 string
defines a sequence of scalar values.  However, Unicode 8-bit, 16-bit
or 32-bit string is merely a sequence of 8-bit, 16-bit or 32-bit
values that may occur in a UTF-8, UTF-16 or UTF-32 string
respectively.  This definition has some odd consequences:

A Unicode 32-bit string is a UTF-32 string, for UTF-32 is not a
multi-word encoding.  An arbitrary string of unsigned 32-bit values is
not in general a Unicode 32-bit string.

All strings of unsigned 16-bit values are Unicode 16-bit strings.  Not
all (Unicode) 16-bit strings are UTF-16 strings.

Not all strings of unsigned 8-bit values are Unicode 8-bit strings, and
not all Unicode 8-bit strings are UTF-8 strings.

I can't think of a practical use for the specific concepts of Unicode
8-bit, 16-bit and 32-bit strings.  Unicode 16-bit strings are
essentially the same as 16-bit strings, and Unicode 32-bit strings are
UTF-32 strings.   'Unicode 8-bit string' strikes me as an exercise in
pedantry; there are more useful categories of 8-bit strings that are
not UTF-8 strings.

Richard.