Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

Philippe Verdy verdy_p at wanadoo.fr
Thu May 7 22:08:21 CDT 2015

The RFC is jsut informative not normative, and thez effective usage and
implementations just support JSON as plain 16-bit streams, even if the
transport syntax requires encoding it in plain-text (using some UTF, not
necessarily UTF-8 even if this is the default).
Try by yourself, you can perfectly send JSON text containing '\uFFFF'
(non-character) or '\uF800' (unpaired surrogate) and I've not seen any JSON
implementation complaining about one or the other, when receiving the JSON
stream and using it in Javascript, you'll see no missing code unit or
replaced code units and no exception as well.

2015-05-08 3:22 GMT+02:00 Daniel Bünzli <daniel.buenzli at erratique.ch>:

> Le vendredi, 8 mai 2015 à 02:16, Philippe Verdy a écrit :
> > It would be more exact to say that JSON strings, just like strings in
> Javascript and Java or many programming languages are just binary streams
> of 16-bit code units.
> I suggest you have a careful read at RFC 7159 as it specifically implies
> that this is not the model it supports (albeit using broken or let's say
> ambiguous/imprecise Unicode terminology).
> > Then the JSON processor will decode this text and will remap it to an
> internal UTF-16 encoding (for characters that are not escaped) and the
> "\uXXXX" will be decoded as plain 16-bit code units. The result will be a
> stream of 16-bit code units, which can then externally be outpout and
> encoded or stored in any convenient encoding that preserves this stream,
> EVEN if this is not valid UTF-16.
> I don't know where you get this from but you won't find any mention of
> this in the standard. We are dealing with text, Unicode scalar values, not
> encodings. At the risk of repeating myself, read section 8.2 of RFC 7159.
> Best,
> Daniel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150508/08becaa4/attachment.html>

More information about the Unicode mailing list