Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?
daniel.buenzli at erratique.ch
Sat May 9 07:16:28 CDT 2015
Le samedi, 9 mai 2015 à 06:24, Philippe Verdy a écrit :
> You are not stuck! You can still regenerate a valid JSON output encoded in UTF-8: it will once again use escape sequences (which are also needed if your text contains quotation marks used to delimit the JSON strings in its syntax.
That's a possible resolution, but a very bad one: I can then no longer in my program distinguish between the JSON strings "\uDEAD" and "\\uDEAD". This exactly leads to the interoperability problems mentioned in section 8.2 of RFC 7159.
You say passing escapes to the programmer is needed if your text contains quotation marks, this is nonsense. A good and sane JSON codec will never let the programmer deal with escapes directly, it is its responsability to allow the programmer to only deal with the JSON *data* not the details of the encoding of the data. As such it will automatically unescape on decoding to give you the data represented by the encoding and automatically escape (if needed) the data you give it on encoding.
> Unlike UTF-8, JSON has never been designed to restrict its strings to have its represented values to be only plain-text, it is a only a serialization of "strings" to valid plain-text using a custom syntax.
You say a lot of things about what JSON is supposed to be/has been designed for. It would be nice to substantiate your claims by pointing at relevant standards. If JSON as in RFC 4627 really wanted to transmit sequences of bytes I think it would have been *much more* explicit.
The introduction of both RFC 4627 (remember, written by the *inventor* of JSON) and RFC 7159 (that obsoletes 4627) say "A string is a sequence of zero or more Unicode characters" as we already mentioned an both agree on this is very imprecise. There are two interpretations:
* This is a sequence of Unicode scalar values, i.e. text (mine)
Now given this imprecision the fact is that you cannot ignore that some stupid people that are very wrong like me will take the first interpretation. Since this interpretation is less liberal you will have to cope with it and acknowledge the fact that lone escaped surrogates may not be interpreted correctly in the wild.
This leads to the clarification and the interoperability warnings of section 8.2 in RFC 7159. If you read carefully these two paragraphs you may infer that their "Unicode character" is more likely to be "Unicode scalar value". These paragraphs were not present in RFC 4267 so the latter was really ambiguous, I would however say RFC 7159 is not, if you don't agree with that we are still left with the above two possible interpretations and if you care about interoperability you should know which interpretation to take.
More information about the Unicode