Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

Philippe Verdy verdy_p at wanadoo.fr
Sat May 9 07:51:18 CDT 2015

2015-05-09 14:16 GMT+02:00 Daniel Bünzli <daniel.buenzli at erratique.ch>:

> Le samedi, 9 mai 2015 à 06:24, Philippe Verdy a écrit :
> > You are not stuck! You can still regenerate a valid JSON output encoded
> in UTF-8: it will once again use escape sequences (which are also needed if
> your text contains quotation marks used to delimit the JSON strings in its
> syntax.
> That's a possible resolution, but a very bad one: I can then no longer in
> my program distinguish between the JSON strings "\uDEAD" and "\\uDEAD".
> This exactly leads to the interoperability problems mentioned in section
> 8.2 of RFC 7159.
> You say passing escapes to the programmer is needed if your text contains
> quotation marks, this is nonsense. A good and sane JSON codec will never
> let the programmer deal with escapes directly, it is its responsability to
> allow the programmer to only deal with the JSON *data* not the details of
> the encoding of the data.

Yes, this is part of the codec, the data itself is not modified and does
not have to handle the syntax (for quotation marks or escapes).

> As such it will automatically unescape on decoding to give you the data
> represented by the encoding and automatically escape (if needed) the data
> you give it on encoding.
> > Unlike UTF-8, JSON has never been designed to restrict its strings to
> have its represented values to be only plain-text, it is a only a
> serialization of "strings" to valid plain-text using a custom syntax.
> You say a lot of things about what JSON is supposed to be/has been
> designed for. It would be nice to substantiate your claims by pointing at
> relevant standards. If JSON as in RFC 4627 really wanted to transmit
> sequences of bytes I think it would have been *much more* explicit.

No instead it speaks (incorrectly) about code points and mixes the concept
with code units.

Code units are just code units nothing else, they are not "characters", and
certainly not in the meaning of "Unicode abstract characters" and not even
"code points" or "scalar values" (and I did not speak about sequences of
"bytes", which is the result of the UTF-8 encoding if this is the charset
used for the transport of the plain-text JSON syntax)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150509/fcfe3148/attachment.html>

More information about the Unicode mailing list