Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

Philippe Verdy verdy_p at wanadoo.fr
Fri May 8 19:33:20 CDT 2015


2015-05-08 14:32 GMT+02:00 Daniel Bünzli <daniel.buenzli at erratique.ch>:

> Le vendredi, 8 mai 2015 à 13:48, Philippe Verdy a écrit :
> > JSON came initially from Javascript, and it is used extensively with
> Javascript.
>
> But not *only* for a long time now.
>
> > The RFC is deviating from the currently running implementations.
>
> Well did you test them all ? There's quite a big list here
> http://www.json.org. Taking a random one mentioned on that page leads me
> to http://golang.org/pkg/encoding/json/ in which they say that they
> replace invalid UTF-16 surrogate pairs by U+FFFD. This is really not very
> surprising since apparently go's strings as text are UTF-8 encoded so when
> you need to produce your results as UTF-8 then you don't have a lot of
> solutions... error and/or U+FFFD.
>

I've already saif that JSON is UTF-8 encoded by default, but this does not
mean that JSON invalidates the escape sequence '\uD800' isolated in a
string.
For this reason JSON strings are not restricted by the textual encoding of
its syntaxic representation.

So no error returned, no replacement by U+FFFD and even unpaired surrogates
are possible, provided that they are escaped.
Basically JSON strings remain equivalent to Javascript strings where
'\uD800' is also a perfectly valid "string".

I make the difference between a "string" and plain-text.

And if the RFC had not been so confusive by mixing terms (notably the term
"code point", it would have may be become a standard. For now it is just a
tentative attempt to standardize it, but it does not work with existing
implementation which have started since the begining as a data
serialization format based on Javascript syntax (with only the removal of
items that are not pure data, such as functions/methods, and more complex
objects like Javascript regexp literals (functionaly equivalent to an
object constructor), object references... keeping only strings, numbers,
and only two structures: ordered arrays and unordered associative arrays
(also called dictionaries and that are also including ordered arrays
considered as associative using number keys, thus reducing it to only one
effetctive structure even if ordered arrays have also a simpler syntaxic
sugar to represent them in a more compact way).

If you mean that JSON string "\uD800" is invalid, it is not longer a data
serialization for Javascript, or other languages also using JSON as a
possible syntax for serializing data into plain-text. JSON was created
because XML (the alternative) was too verbose and had restrictions in its
"text" elements. It seems that the RFC just wants to apply to JSON the same
restrictions as found in XML, but it deviates JSON from its objective, and
I'm convinced that such restrictions are not enforced at all in many JSON
implementations that do not attempt to validate if the value of the
represented string a valid plain-text. JSON is only transforming strings
into valid plain-text representation using an encoding syntax using
separators and escape sequences, nothing else.

If the RFC wants to add such restrictions, it is mixing two layers: the
syntaxic (plain text) layer and the lower layer for the internally
represented values which are just a stream of code units.

And the only difference in that case is the behavior for isolated/unpaired
surrogates (not restricted in Javascript or many languages defining
"strings", but restricted in plain-text, but JSON is there to offer the
serializatrion scheme allowing strings to be safely converted to plain-text)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150509/42cb4031/attachment.html>


More information about the Unicode mailing list