Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

Daniel Bünzli daniel.buenzli at erratique.ch
Thu May 7 15:29:27 CDT 2015

Le jeudi, 7 mai 2015 à 21:59, Markus Scherer a écrit :
> I assume that the JSON spec deliberately allows anything that Java and JavaScript allow. In particular, there is no requirement for a Java String or JavaScript string to contain "text", or well-formed UTF-16, or only assigned characters.  

> Some code stores binary data (sequence of arbitrary 16-bit unsigned integers) in a "string", just because it is easy and fairly efficient to transport.
> You should "validate" *text* only when you are certain that it is indeed text.
Section 8.2 [1] of the spec specifically says that only strings that represent sequences of Unicode scalar values (they say "characters") are interoperable and that strings that do not represent such sequences like "\uDEAD" can lead to unpredictable behaviour.  

If you want to transmit binary data reliably in json you must apply some form of binary to Unicode scalar value encoding (like in most text based interchange formats).  



[1] https://tools.ietf.org/html/rfc7159#section-8.2

More information about the Unicode mailing list