Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

Jeff Senn senn at maya.com
Thu May 7 14:23:49 CDT 2015


While this may not change the OP's need for such tool, I read the JSON specification as allowing all codepoints 0x0000 - 0xffff regardless of whether they map to "valid" unicode characters.  The allowed use of quoted utf-16 surrogate pairs for characters with codepoints > 0xffff (without also specifying that unpaired surrogates are invalid) is troubling on the margin, and complicates such a validation.

Another complication is that a "JSON document" might itself be non-ascii (utf8, 16 or 32) and have unicode characters as literals within quoted strings...

Not to mention the ambiguous case of a surrogate pair where half is literal and the other half quoted...

> On May 7, 2015, at 2:33 PM, Mark Davis ☕️ <mark at macchiato.com> wrote:
> 
> ​The simplest approach would be to use ICU in a little program that scans the file. For example, you could write a little Java program that would scan the file, and turn any any sequence of (\uXXXX)+ into a String, then test that string with:
> 
> static final UnicodeSet OK = new UnicodeSet("[^[:unassigned:][:surrogate:]]]").freeze();
> ...
> // inside the scanning function
> boolean isOk​ = OK.containsAll(slashUString);
> 
> It is key that it has to grab the entire sequence of \uXXXX in a row; otherwise it will get the wrong answer.
> 
> 
> Mark <https://google.com/+MarkDavis>
> 
> — Il meglio è l’inimico del bene —
> 
> On Thu, May 7, 2015 at 10:49 AM, Doug Ewell <doug at ewellic.org <mailto:doug at ewellic.org>> wrote:
> "Costello, Roger L." <Costello at mitre dot org> wrote:
> 
> > Are there tools to scan a JSON document to detect the presence of
> > \uXXXX, where XXXX does not correspond to any Unicode character?
> 
> A tool like this would need to scan the Unicode Character Database, for
> some given version, to determine which code points have been allocated
> to a coded character in that version and which have not.
> 
> --
> Doug Ewell | http://ewellic.org <http://ewellic.org/> | Thornton, CO ����
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150507/174f3ae2/attachment.html>


More information about the Unicode mailing list