Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?
Daniel Bünzli
daniel.buenzli at erratique.ch
Thu May 7 14:35:00 CDT 2015
Le jeudi, 7 mai 2015 à 14:46, Costello, Roger L. a écrit :
> The JSON specification says that a character may be escaped using this notation: \uXXXX (XXXX are four hex digits)
>
> However, not every four hex digits corresponds to a Unicode character.
If we refer to the wording of RFC 7159, they are using imprecise terminology. They are meaning "any code point in U+0000 to U+FFFF" (since you need escaped surrogate pairs to be able to escape scalar values not in the BMP). You can understand their definition of a "character that may be escaped" by this sentence of section 7 [1]:
"Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF) then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point."
However if you are concerned about wrong surrogate sequences or lone surrogate characters (of which the standard has sadly nothing to say about [2]), I have written a best-effort json parser [3] that reports them and allows you to continue by replacing the offending escape sequences by U+FFFD. There's a test command line tool named jsontrip in the distribution that allows you among other things to report these errors. For example:
> echo '["\uDEAD"]' | jsontrip
-:1.2-1.8: illegal escape, U+DEAD lone low surrogate
Best,
Daniel
[1] https://tools.ietf.org/html/rfc7159#section-7
[2] https://tools.ietf.org/html/rfc7159#section-8.2
[3] http://erratique.ch/software/jsonm
More information about the Unicode
mailing list