annotations (was: NamesList.txt as data source)

Sun Mar 13 00:55:24 CST 2016

On Thu, Mar 10 2016 at 22:40 CET, kenwhistler at att.net writes:

> The *reason* that NamesList.txt exists at all is to drive the tool,
> unibook, that formats the full Unicode code charts for posting. 

[...]

On Fri, Mar 11 2016 at  3:13 CET, asmusf at ix.netcom.com writes:
> On 3/10/2016 5:49 PM, "J. S. Choi" wrote:

>> One thing about NamesList.txt is that, as far as I have been able to
>> tell, it’s the only machine-readable, parseable source of those
>> annotations and cross-references.

[...]

> This is a different issue. The nameslist.txt is a reasonable source
> for driving other formatting programs than just Unibook.

Exactly.

A student of mine wrote a font sampling program producing output in a
Unibook-like form. For this purpose he wrote also a converter from
NamesList format to XML:

          https://github.com/ppablo28/fntsample_ucd_comments

          https://github.com/ppablo28/ucd_xml_parser

I use the XML version of NamesList to provide my own comments to
characters (work in progress):

         https://bitbucket.org/jsbien/parkosz-font/downloads/Parkosz1907draft.pdf

Other examples of NamesList.txt use are

          http://www.fileformat.info/info/unicode/
          https://codepoints.net/

Although not exactly the formatting programs, in my opinion they
constitute also a valid use.

> In fact, the possibility of reuse in this context probably among the
> unstated rationales for making the information and syntax available in
> the first place.

I understand there is no intention to make an official XML version of
the file as it would require changes in Unibook?

[...]

>> What are these other primary sources that maintain these other
>> annotation data; are they publicly available? If the name list is the
>> only place where these sources’ data have been published, then, for
>> better or for worse, the name list is all that is available for much
>> information on many code points’ usage.

> See my first through third paragraph.

You wrote:

[...]

> There are explanations about character use that are only maintained in
> the PDF of the core specification, where this information is packaged
> in a way that can be understood by a human reader, but is not amenable
> to be extracted by machine.
>
> While the annotations, comments, cross references etc. in Namelist.txt
> appear, formally, to be machine extractable, the way they are created
> and managed make them just as much "human-accessible" only as the core
> specification.

I'm afraid it's not clear for me. Let's take an example. Sometime ago I
inquired about a controversial alias for U+018D:

        http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0014.html

Can I really find anything about "reversed Polish-hook o" in the core
specification which is not a literal copy of the information from
NamesList.txt?

Best regards

Janusz

-- 
                           ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/