Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

Daniel Bünzli daniel.buenzli at erratique.ch
Fri May 8 20:27:20 CDT 2015

Le samedi, 9 mai 2015 à 02:33, Philippe Verdy a écrit :
> 2015-05-08 14:32 GMT+02:00 Daniel Bünzli <daniel.buenzli at erratique.ch (mailto:daniel.buenzli at erratique.ch)>:
> > Well did you test them all ? There's quite a big list here http://www.json.org. Taking a random one mentioned on that page leads me to http://golang.org/pkg/encoding/json/ in which they say that they replace invalid UTF-16 surrogate pairs by U+FFFD. This is really not very surprising since apparently go's strings as text are UTF-8 encoded so when you need to produce your results as UTF-8 then you don't have a lot of solutions... error and/or U+FFFD.
> I've already saif that JSON is UTF-8 encoded by default, but this does not mean that JSON invalidates the escape sequence '\uD800' isolated in a string.

You didn't get what I said. When a parser returns a JSON string it just parsed and that it wants to give it back to the programmer using the native string of the language and that these strings happen to be UTF-8 encoded in this language, then in presence of such lone surrogates you are stuck and need to do something as you cannot encode them in the UTF-8 string.  

(I understand that in *your* interpretation this should not happen since I should define a special data type to represent these JSON strings so that they behave like JavaScript strings; that would be indeed very practical, none of my language native string tools can be used on that…)
Anyways, we are largely OT at this point.  



More information about the Unicode mailing list