Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

Daniel Bünzli daniel.buenzli at erratique.ch
Fri May 8 19:26:59 CDT 2015


Le samedi, 9 mai 2015 à 00:37, Doug Ewell a écrit :
> Noncharacters are Unicode scalar values,

Non characters are Unicode scalar values by definitions D14 and D76.
  
> while unpaired surrogates are not.

All surrogates code points are not Unicode scalar values by D71, D73 and D76.
  
> This means noncharacters may appear in a well-formed UTF-8, -16, or
> -32 string,

It take "appear" to mean "be encoded". Yes, any Unicode encoding forms allows to interchange all scalar values by D79.

(However noncharacters are not designed to be openly interchanged see "Restricted interchange" on p. 31. of 7.0.0)

> while unpaired surrogates may not.
All surrogate code points *paired or not* cannot be encoded in UTF-{8,16,32} by D92, D91, D90. All these encoding forms, by definition, assign only Unicode scalar values to code units sequences (see also the already mentioned p. 31. which clarifies this).

However in UTF-16 code unit sequences may contain surrogate pairs (that taken together represent a Unicode scalar value).

> They may both be part of a "Unicode string" which does not claim to be in any given encoding
> form.

Not sure what you mean by that. So I let someone else answer.  

Best,

Daniel  





More information about the Unicode mailing list