Unicode Digest, Vol 56, Issue 20

Marcel Schneider via Unicode unicode at unicode.org
Thu Aug 30 18:14:41 CDT 2018


Thank you for looking into this. First, I’m unable to retrieve the publication you are citing, 
but a February thread had nearly the same subject, referring to Vol. 50. How did you 
compute these figures? Is that a code phrase to say: “The same questions over and 
over again; let’s settle this on the record, as a reference for later inquiries.”

Also, "unicode-request at unicode.org" doesn’t appear to seem to be a valid e-mail address.
That would mean that I’d better send a proposal with an enhancement request to
docsubmit at unicode.org, rather than contribute to the topic while it is being discussed 
on the Unicode Public Mail List?

OK I’ll try to get something out of this, because many people really want things to grow 
better:

On 30/08/18 20:37 Doug Ewell via Unicode wrote:
> 
> UnicodeData.txt was devised long before any of the other UCD data files.

I can’t think of any era in the computer age where file headers were uncommon, 
and where a parser able to process semicolons couldn’t be directed to make sense
of crosshatches. If ever releasing a headerless file was a mistake, implementers 
would be able to anticipate that it might be corrected at some point. Implementations 
are to be updated at every single Unicode release, that’s what I’m able to tell, while 
ignoring the arcanes of frozen APIs.

> Though it might seem like a simple enhancement to us, adding a header block, or even a single line,
> would break a lot of existing processes that were built long ago to parse this file.

They are hopelessly outdated anyway, and most of them would have been replaced with something 
better since a long time. The remainder might not be worth bothering the rest of the world with 
headerless files.

> So Unicode can't add a header to this file, and that is the reason the format can never be changed
> (e.g. with more columns). That is why new files keep getting created instead.

I figured out something like that rationale, and I can also understand that Unicode isn’t going 
to keep releasing headerless files while waiting for a guy telling them not to do so, and then
to suddenly add the missing header. Also I didn’t really ask for that, but suggested adding 
yet another *new* file, not changing the data structure of the existing UnicodeData.txt. 

As of the reference, a Google search for "unicodedataextended.txt" just brought it up:
http://www.unicode.org/review/pri297/

Having said that, I still think that while not parsing a header line in a process is a 
reasonable position if the field structure is known to be stable, not being able to *skip* 
a header is sort of odd.

> The XML format could indeed be expanded with more attributes and more subsections.
> Any process that can parse XML can handle unknown stuff like this without misinterpreting
> the stuff it does know.

Agreed. I’m not questioning XML. But I’m using spreadsheets. I don’t know how many computer
scientists do use spreadsheets. Perhaps we’re not many looking up UnicodeData.txt that way
(I use it in raw text, too, and I look up ucd.nounihan.flat.xml). Generating code in a 
spreadsheet is considered quick-and-dirty. I don’t agree it’s dirty, but it’s quick.

And above all, it appears that doing certain research in spreadsheets is the most efficient 
way to check whether character properties are matching character identity. Using spreadsheet 
software is trivial, so it might be disconsidered and left to non-scientists, while it is 
closer to human experience and allows to do research in nearly no time, by adding columns, 
filters and formulae, that one would probably spend weeks to code in C, Lisp, Perl or Python 
(that I cannot do, so I’m biased).

> That's why the only two reasonable options for getting UCD data are to read all the tab- and semicolon-delimited files,
> and be ready for new files, or just read the XML. Asking for changes to existing UCD file formats is kind of a non-starter,
> given these two alternatives.

Given the above, one can easily understand why I do not agree with being limited to these two
alternatives. 

Given a process must be able to be updated to be able… to grab a newly added small file 
from the UCD, it can as well be updated to be able to skip file comments, and even to be able 
to parse a new *large* file from the UCD.

On the other hand, given Unicode are ready to add new small semicolon-delimited files, 
they might wish to add as well a new *large* semicolon-delimited file to the UCD.
That large file would have a file header and a header line, and be specified as being flexible.
That file might have one hundred fields delimited by 99 semicolons. These 5 million semicolons 
would still be more lightweight than 5 million attribute names plus the XML tags.

The added value is that people using spreadsheets have a handy file to import, rather than 
each individual having to convert a large XML file to a large CSV file, by lack of the latter
being readily provided by Unicode. 

If this discussion has a positive echo, I or somebody else may submit an appropriate proposal.
But I’d prefer not repeating the mistake of not discussing a topic on Unicode Public prior to 
submitting a proposal that is then kindly put on the agenda, but discussed in disfavor and 
dismissed in disgrace twice at UTC meetings. And figure out why I didn’t wish upstream discussion
here? Because I was naively afraid that the unveiled mistakes could reflect badly on some people.

Turned out that nothing reflects badly on anybody. 

(So UnicodeData.txt could as well get its missing header BTW.)

Regards,

Marcel



More information about the Unicode mailing list