annotations (was: NamesList.txt as data source)
Janusz S. Bień
jsbien at mimuw.edu.pl
Sun Mar 13 00:55:24 CST 2016
On Thu, Mar 10 2016 at 22:40 CET, kenwhistler at att.net writes:
> The *reason* that NamesList.txt exists at all is to drive the tool,
> unibook, that formats the full Unicode code charts for posting.
[...]
On Fri, Mar 11 2016 at 3:13 CET, asmusf at ix.netcom.com writes:
> On 3/10/2016 5:49 PM, "J. S. Choi" wrote:
>> One thing about NamesList.txt is that, as far as I have been able to
>> tell, it’s the only machine-readable, parseable source of those
>> annotations and cross-references.
[...]
> This is a different issue. The nameslist.txt is a reasonable source
> for driving other formatting programs than just Unibook.
Exactly.
A student of mine wrote a font sampling program producing output in a
Unibook-like form. For this purpose he wrote also a converter from
NamesList format to XML:
https://github.com/ppablo28/fntsample_ucd_comments
https://github.com/ppablo28/ucd_xml_parser
I use the XML version of NamesList to provide my own comments to
characters (work in progress):
https://bitbucket.org/jsbien/parkosz-font/downloads/Parkosz1907draft.pdf
Other examples of NamesList.txt use are
http://www.fileformat.info/info/unicode/
https://codepoints.net/
Although not exactly the formatting programs, in my opinion they
constitute also a valid use.
> In fact, the possibility of reuse in this context probably among the
> unstated rationales for making the information and syntax available in
> the first place.
I understand there is no intention to make an official XML version of
the file as it would require changes in Unibook?
[...]
>> What are these other primary sources that maintain these other
>> annotation data; are they publicly available? If the name list is the
>> only place where these sources’ data have been published, then, for
>> better or for worse, the name list is all that is available for much
>> information on many code points’ usage.
> See my first through third paragraph.
You wrote:
[...]
> There are explanations about character use that are only maintained in
> the PDF of the core specification, where this information is packaged
> in a way that can be understood by a human reader, but is not amenable
> to be extracted by machine.
>
> While the annotations, comments, cross references etc. in Namelist.txt
> appear, formally, to be machine extractable, the way they are created
> and managed make them just as much "human-accessible" only as the core
> specification.
I'm afraid it's not clear for me. Let's take an example. Sometime ago I
inquired about a controversial alias for U+018D:
http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0014.html
Can I really find anything about "reversed Polish-hook o" in the core
specification which is not a literal copy of the information from
NamesList.txt?
Best regards
Janusz
--
,
Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
More information about the Unicode
mailing list