Parsers for the UnicodeSet notation?

Eric Muller emuller at
Thu Jul 24 01:51:15 CDT 2014

Thanks for the answers.

I take it from Steve's answer that Roozbeh's parser may work today but 
may break tomorrow.

A couple of suggestions:

- a full "parser" of UnicodeSet is non-trivial, since it involves having 
access to property values. That does not seem really necessary for 
exemplars, so may be it would be good restrict the UnicodeSet there.

- alternatively, since the extent of a UnicodeSet can involve property 
values, it means that the extent can depend on the Unicode version from 
which those values come from. Which means that there ought to be a 
Unicode version number in the CLDR data; it would be nice for that 
number to be present in the data files (I don't see one in he.xml)

> Incidentally, I copy/pasted the punctuation exemplar characters for 
> he.xml into the utility, and it reported that the set contains 8,130 
> code points, including the ascii letters. Somehow, that seems 
> incorrect. What did I do wrong?

Sorry, I took the UnicodeSet straight out of he/characters.json, without 
handling the json serialization (or rather deserialization) of strings.

Taking it straight out of he.xml (where there is no serialization 
effect) gives a much more reasonable set of twenty strings.  XML wins 
again ;-)


More information about the Unicode mailing list