Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

Philippe Verdy verdy_p at wanadoo.fr
Fri May 8 23:24:36 CDT 2015

2015-05-09 3:27 GMT+02:00 Daniel Bünzli <daniel.buenzli at erratique.ch>:

> Le samedi, 9 mai 2015 à 02:33, Philippe Verdy a écrit :
> > 2015-05-08 14:32 GMT+02:00 Daniel Bünzli <daniel.buenzli at erratique.ch
> (mailto:daniel.buenzli at erratique.ch)>:
> > > Well did you test them all ? There's quite a big list here
> http://www.json.org. Taking a random one mentioned on that page leads me
> to http://golang.org/pkg/encoding/json/ in which they say that they
> replace invalid UTF-16 surrogate pairs by U+FFFD. This is really not very
> surprising since apparently go's strings as text are UTF-8 encoded so when
> you need to produce your results as UTF-8 then you don't have a lot of
> solutions... error and/or U+FFFD.
> >
> >
> > I've already saif that JSON is UTF-8 encoded by default, but this does
> not mean that JSON invalidates the escape sequence '\uD800' isolated in a
> string.
> You didn't get what I said. When a parser returns a JSON string it just
> parsed and that it wants to give it back to the programmer using the native
> string of the language and that these strings happen to be UTF-8 encoded in
> this language, then in presence of such lone surrogates you are stuck and
> need to do something as you cannot encode them in the UTF-8 string.

You are not stuck! You can still regenerate a valid JSON output encoded in
UTF-8: it will once again use escape sequences (which are also needed if
your text contains quotation marks used to delimit the JSON strings in its

Unlike UTF-8, JSON has never been designed to restrict its strings to have
its represented values to be only plain-text, it is a only a serialization
of "strings" to valid plain-text using a custom syntax.

There's absolutely no need to restrict strings values to the same
validation rules and the same subset as the set of acceptable plain-text:
this is not the same layer: one is the string level (in fact not bound to
any character encoding and not restricted to text), another is the
plain-text, and JSON is the adapter/converter between these two
representations. Do not mix these two distinct layers.

(this is also the case when someone confuses an XML document with its DOM:
not the same layer)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150509/824ef8e7/attachment.html>

More information about the Unicode mailing list