Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

Philippe Verdy verdy_p at wanadoo.fr
Sat May 9 08:07:12 CDT 2015

2015-05-09 14:51 GMT+02:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> You say a lot of things about what JSON is supposed to be/has been
>> designed for. It would be nice to substantiate your claims by pointing at
>> relevant standards. If JSON as in RFC 4627 really wanted to transmit
>> sequences of bytes I think it would have been *much more* explicit.
> No instead it speaks (incorrectly) about code points and mixes the concept
> with code units.

In fact it mixes/confuses three separate concepts, i.e. three layers
distinct (that the Unicode standard distinguishes clearly):
-1.  the internal dataset (values of "strings" as expected by programmers
and transmitted via the CODEC of the JSON parser/encoder), using code units
in a fixed size (16-bit)
-2.  the plain-text syntax of JSON (which is independant of the actual
character encoding but can be formalized as a stream of Unicode code points
-3.  the serialization of this plain-text in a stream of bytes (using some
UTF encoding scheme, or other legacy 8-bit charsets).

The initial implementation of JSON, in Javascript, still used today, just
performs the adaptation of the internal dataset (16-bit streams) to
plain-text (layers 1. and 2. above).

Then Javascript itself specifies no seialization of its source: this is
part of the MIME standard for the transport (using MIME "charset" attribute
to the media type) when using protocols like HTTP or HTTPS, or some
external metadata, or a static definition which is system-dependant (for
example in local file systems if they do not store the metadata as a file
attribute, a case for which the "BOM" or similar signatures was created or
for which there is specific syntax in some languages like XML or HTML for
specifying the charset at the beginning of the file, or by using some
"charset guesser").

Here also Javascript programmers do not have to worry about the layers 2.
and 3. above, they just have to handle 16-bit streams (same remark in PHP,
Java or many programming languages): they work at the layer 1 where there's
a single encoding, a single size of code unit for everything, and no
restriction of values on code units. Same thing when working with the DOM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150509/54f99898/attachment.html>

More information about the Unicode mailing list