Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

Doug Ewell doug at ewellic.org
Fri May 8 17:37:57 CDT 2015


Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:

>> Try by yourself, you can perfectly send JSON text containing '\uFFFF'
>> (non-character) or '\uF800' (unpaired surrogate) and I've not seen
>> any JSON implementation complaining about one or the other, when
>> receiving the JSON stream and using it in Javascript, you'll see no
>> missing code unit or replaced code units and no exception as well.
>
> Unicode Consortium standards and recommendations allow non-characters
> to be sent; as far as I can make out, they are just not to be thought
> of as unstandardised graphic characters.

As I understand it, from a purely Unicode standpoint, there are
differences here between noncharacters and unpaired surrogates.

Noncharacters are Unicode scalar values, while unpaired surrogates are
not. This means noncharacters may appear in a well-formed UTF-8, -16, or
-32 string, while unpaired surrogates may not. They may both be part of
a "Unicode string" which does not claim to be in any given encoding
form.

Authoritative corrections are welcome to help solidify my understanding.

I don't wish to get involved in debates over JSON. I've read RFC 7159
and I know what it says.

--
Doug Ewell | http://ewellic.org | Thornton, CO ����




More information about the Unicode mailing list