annotations (was: NamesList.txt as data source)

Sun Mar 13 21:14:05 CDT 2016

On Sun, 13 Mar 2016 13:03:20 -0600, Doug Ewell  wrote:

> My point is that of J.S. Choi and Janusz Bień: the problem with
> declaring NamesList off-limits is that it does contain information that
> is either:
> 
> • not available in any other UCD file, or
> • available, but only in comments (like the MAS mappings), which aren't
> supposed to be parsed either.
> 
> Ken wrote:
> 
> > [ .. ] NamesList.txt is itself the result of a complicated merge
> > of code point, name, and decomposition mapping information from
> > UnicodeData.txt, of listings of standardized variation sequences from
> > StandardizedVariants.txt, and then a very long list of annotational
> > material, including names list subhead material, etc., maintained in
> > other sources.
> 
> But sometimes an implementer really does need a piece of information
> that exists only in those "other sources." When that happens, sometimes
> the only choices are to resort to NamesList or to create one's own data
> file, as Ken did by parsing the comment lines from the math file. Both
> of these are equally distasteful when trying to be conformant.

If so, then extending the XML UCD with all the information that is actually missing in it while available in the Code Charts and NamesList.txt, ends up being a good idea. But it still remains that such a step would exponentially increase the amount of data, because items that were not meant to be systematically provided, must be.

Further I see that once this is completed, other requirements could need to tackle the same job on the core specs.

The point would be to know whether in Unicode implementation and i18n, those needs are frequent. E.g. the last Apostrophe thread showed that full automatization is sometimes impossible anyway.

Marcel