Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

Norbert Lindenberg unicode at lindenbergsoftware.com
Sat May 9 01:26:56 CDT 2015


RFC 7158 section 7 [1] provides not only the \uXXXX notation for Unicode code points in the Basic Multilingual Plane, but also a 12-character sequence encoding the UTF-16 surrogate pair (i.e. \uYYYY\uZZZZ with 0xD800 ≤ YYYY < 0xDC00 ≤ ZZZZ ≤ 0xDFFF) for supplementary Unicode code points. A tool checking for escape sequences that don’t correspond to any Unicode character must be aware of this, because neither \uYYYY nor \uZZZZ by itself would correspond to any Unicode character, but their combination may well do so.

Norbert

[1] https://tools.ietf.org/html/rfc7158#section-7


> On May 7, 2015, at 5:46 , Costello, Roger L. <costello at mitre.org> wrote:
> 
> Hi Folks,
> 
> The JSON specification says that a character may be escaped using this notation: \uXXXX    (XXXX are four hex digits)
> 
> However, not every four hex digits corresponds to a Unicode character. 
> 
> Are there tools to scan a JSON document to detect the presence of \uXXXX, where XXXX does not correspond to any Unicode character?
> 
> /Roger
> 




More information about the Unicode mailing list