Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)
doug at ewellic.org
Fri May 8 17:37:57 CDT 2015
Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:
>> Try by yourself, you can perfectly send JSON text containing '\uFFFF'
>> (non-character) or '\uF800' (unpaired surrogate) and I've not seen
>> any JSON implementation complaining about one or the other, when
>> missing code unit or replaced code units and no exception as well.
> Unicode Consortium standards and recommendations allow non-characters
> to be sent; as far as I can make out, they are just not to be thought
> of as unstandardised graphic characters.
As I understand it, from a purely Unicode standpoint, there are
differences here between noncharacters and unpaired surrogates.
Noncharacters are Unicode scalar values, while unpaired surrogates are
not. This means noncharacters may appear in a well-formed UTF-8, -16, or
-32 string, while unpaired surrogates may not. They may both be part of
a "Unicode string" which does not claim to be in any given encoding
Authoritative corrections are welcome to help solidify my understanding.
I don't wish to get involved in debates over JSON. I've read RFC 7159
and I know what it says.
Doug Ewell | http://ewellic.org | Thornton, CO
More information about the Unicode