Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

Markus Scherer markus.icu at gmail.com
Thu May 7 14:59:54 CDT 2015


I assume that the JSON spec deliberately allows anything that Java and
JavaScript allow. In particular, there is no requirement for a Java String
or JavaScript string to contain "text", or well-formed UTF-16, or only
assigned characters. Some code stores binary data (sequence of arbitrary
16-bit unsigned integers) in a "string", just because it is easy and fairly
efficient to transport.

You should "validate" *text* only when you are certain that it is indeed
text. And when you do validate, you might want to be narrower than
"assigned character"; for example, you might require Unicode identifiers or
XML NMTOKENS or whatever. Also remember that "assigned" and "identifier"
and such depend on the version of Unicode your library currently implements.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150507/64062c47/attachment.html>


More information about the Unicode mailing list