Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

Philippe Verdy verdy_p at wanadoo.fr
Sat May 9 13:44:32 CDT 2015


If is not necessary, in fact that same section is also repeating that any
"code point" from U+0000 to U+FFFF is representable with the escape
sequence, without restriction !
This just confirms that JSON does not really encode Unicode strings but
just streams of arbitrary 16-bit code-units (and then possibly reencoded
into an internal encoding scheme used by JSON parsers, that internal
encoding being bound to the programming environment and its internal binary
API or exposed variables or properties).

The fact that it is also bond to the plain-text encoding is just because
the plain-text characters used in its syntax that not encoded with those
escape sequences, and that are not assigned a special role for delimiting
string literals, will be decoded from the input syntax and then reencoded
into their equivalent in the internal encoding (in the parser, or exposed
by the parser in its returned variables or properties):
- if the transport format is UTF-8, the syntaxic file will be read using an
UTF-8 scanner returning code points or small strings containing the full
sequence representing a single code point (over MIME-compatible transports
this uses the charset settings of this transport). These codepoints are
then converted to one or two 16-bit code units. Then the JSON syntax is
recognized by its parser, which will recognize string delimiters, and then
also the escape sequences which will be parsed and also converted to 16-bit
code units.
Then this internal stream of 16-bit code units will be exposed to the
output using the encoding expected by the JSON client or programming
environement.

In summary, the refernece to Unicode in the RFCs for JSON is not really
necesssary, all it needs to say is that the JSON parsers must be able to
accept a file containing any plain-text valid in its transport encoding
scheme, and that it will be able to decode from it the stream of 16bit code
units and generate a valid output in the encoding expected by the client
(when the client is Javascript or Java, the internal encoding will be the
same as the exposed encoding ; this won't be true in Lua, or PHP or many
C/C++ programs that often prefer using 8-bit strings; Some languages are
hybrids and support two kinds of strings: 8-bit strings and 16-bit strings,
rarely 32-bit strings)

2015-05-09 8:26 GMT+02:00 Norbert Lindenberg <unicode at lindenbergsoftware.com
>:

> RFC 7158 section 7 [1] provides not only the \uXXXX notation for Unicode
> code points in the Basic Multilingual Plane, but also a 12-character
> sequence encoding the UTF-16 surrogate pair (i.e. \uYYYY\uZZZZ with 0xD800
> ≤ YYYY < 0xDC00 ≤ ZZZZ ≤ 0xDFFF) for supplementary Unicode code points. A
> tool checking for escape sequences that don’t correspond to any Unicode
> character must be aware of this, because neither \uYYYY nor \uZZZZ by
> itself would correspond to any Unicode character, but their combination may
> well do so.
>
> Norbert
>
> [1] https://tools.ietf.org/html/rfc7158#section-7
>
>
> > On May 7, 2015, at 5:46 , Costello, Roger L. <costello at mitre.org> wrote:
> >
> > Hi Folks,
> >
> > The JSON specification says that a character may be escaped using this
> notation: \uXXXX    (XXXX are four hex digits)
> >
> > However, not every four hex digits corresponds to a Unicode character.
> >
> > Are there tools to scan a JSON document to detect the presence of
> \uXXXX, where XXXX does not correspond to any Unicode character?
> >
> > /Roger
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150509/68b56f51/attachment.html>


More information about the Unicode mailing list