NamesList.txt as data source

Thu Mar 10 20:13:21 CST 2016

On 3/10/2016 5:49 PM, "J. S. Choi" wrote:
> One thing about NamesList.txt is that, as far as I have been able to tell, it’s the only machine-readable, parseable source of those annotations and cross-references.

There are explanations about character use that are only maintained in 
the PDF of the core specification, where this information is packaged in 
a way that can be understood by a human reader, but is not amenable to 
be extracted by machine.

While the annotations, comments, cross references etc. in Namelist.txt 
appear, formally, to be machine extractable, the way they are created 
and managed make them just as much "human-accessible" only as the core 
specification.

The goal getting a complete and machine-readable description of 
character behavior is illusory.
>
> As part of the Unicode Standard and the UCD, the name lists’ annotations and cross-references contain much useful data on the intended usage of characters and code points beyond the core specification’s chapters. I have long held an interest in making the name-list data more universally accessible to the general public, especially to visually impaired people—i.e., using screen-reader-friendly HTML rather than PDF—while making clear that the annotations are merely references to the original, normative Standard’s actual code charts and name lists.

This is a different issue. The nameslist.txt is a reasonable source for 
driving other _formatting_ programs than just Unibook. In fact, the 
possibility of reuse in this context probably among the unstated 
rationales for making the information and syntax available in the first 
place.

Let's understand this properly: using the file to translate it into a 
"human-readable" output format is a proper use of this data, even if 
that translation is done using a mechanical too, as long as the format is
a) a format that benefits from the special shortcuts taken in selecting 
the information present in the namelist.txt file,
b) a format intended to be interpreted by a observant and intelligent 
human reader, and not
c) a format intended as direct input to any text-processing algorithm, 
or any algorithm that "understands" the contents
>
> What are these other primary sources that maintain these other annotation data; are they publicly available? If the name list is the only place where these sources’ data have been published, then, for better or for worse, the name list is all that is available for much information on many code points’ usage.
See my first through third paragraph.

A./

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160310/1542a356/attachment.html>