NamesList.txt as data source

Doug Ewell doug at ewellic.org
Sun Mar 27 13:38:53 CDT 2016


Asmus Freytag wrote:

> Nobody disputes that subheaders are informative. However, subheaders
> do not define a character property.

Janusz was making a point that the CLDR data sometimes treats them as 
such, or at least as a kind of supplementary property.

> There are several good reasons:
>
> 1. They do not "classify" characters in a uniform way: For some ranges
> they give the purpose for which the character was encoded (as in your
> example), for others, they give the type of character (vowel,
> consonant), and in some cases they are free of information
> ("Miscellaneous addition").
>
> 2. Even where they give the purpose for which the character was
> encoded, they do not necessarily attest that the characters in that
> range are never used for other purposes.
>
> 3. The information is purely editorial, and as such, changed by the
> editors as needed, not assigned as result of a vote in the Unicode
> Technical Committee.
>
> 4. They appear to be more "formal" than they are, just because they
> are presented with semantic markup in the input file to the code chart
> layout tool; with the file being a rather structured file, only
> because it describes a tabular presentation of data. However, see
> points (1) through (3) on why this superficial appearance of formality
> is misleading.

It seems that the main concern about using NamesList.txt to obtain 
information beyond what is available in other UCD sources is that people 
might treat that additional information as normative and immutable, when 
it is not.

It is understood that UTC members draw important distinctions between 
normative and informative material, and between material that is 
immutable and that which may change over time. For many purposes, these 
distinctions are crucial. However, there are uses for Unicode character 
data that do not depend on these distinctions. Often it is simply not a 
problem if, say, CAT FACE WITH WRY SMILE acquires a new informative 
cross-reference in one Unicode release, and that cross-reference 
suddenly changes or disappears in the next release.

My suggestion to assuage these fears is for UTC to add additional 
warnings to the file header (right below "This file is 
semi-automatically derived...") or to NamesList.html, or both, basically 
stating that any information in NamesList.txt beyond that which can be 
found in other UCD files is informative and subject to change without 
notice. Then the burden, if such it is, will be on users to heed these 
warnings.

--
Doug Ewell | http://ewellic.org | Thornton, CO ���� 



More information about the Unicode mailing list