Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

Mark Davis ☕️ mark at macchiato.com
Thu May 7 13:33:54 CDT 2015


​The simplest approach would be to use ICU in a little program that scans
the file. For example, you could write a little Java program that would
scan the file, and turn any any sequence of (\uXXXX)+ into a String, then
test that string with:

static final UnicodeSet OK = new
UnicodeSet("[^[:unassigned:][:surrogate:]]]").freeze();
...
// inside the scanning function
boolean isOk​ = OK.containsAll(slashUString);

It is key that it has to grab the entire sequence of \uXXXX in a row;
otherwise it will get the wrong answer.


Mark <https://google.com/+MarkDavis>

*— Il meglio è l’inimico del bene —*

On Thu, May 7, 2015 at 10:49 AM, Doug Ewell <doug at ewellic.org> wrote:

> "Costello, Roger L." <Costello at mitre dot org> wrote:
>
> > Are there tools to scan a JSON document to detect the presence of
> > \uXXXX, where XXXX does not correspond to any Unicode character?
>
> A tool like this would need to scan the Unicode Character Database, for
> some given version, to determine which code points have been allocated
> to a coded character in that version and which have not.
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ����
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150507/f3e52124/attachment.html>


More information about the Unicode mailing list