Unicode Digest, Vol 56, Issue 20

Doug Ewell via Unicode unicode at unicode.org
Thu Aug 30 13:27:30 CDT 2018


UnicodeData.txt was devised long before any of the other UCD data files. Though it might seem like a simple enhancement to us, adding a header block, or even a single line, would break a lot of existing processes that were built long ago to parse this file.
So Unicode can't add a header to this file, and that is the reason the format can never be changed (e.g. with more columns). That is why new files keep getting created instead.
The XML format could indeed be expanded with more attributes and more subsections. Any process that can parse XML can handle unknown stuff like this without misinterpreting the stuff it does know.
That's why the only two reasonable options for getting UCD data are to read all the tab- and semicolon-delimited files, and be ready for new files, or just read the XML. Asking for changes to existing UCD file formats is kind of a non-starter, given these two alternatives.


--Doug Ewell | Thornton, CO, US | ewellic.org
-------- Original message --------Message: 3Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST)
From: Marcel Schneider via Unicode <unicode at unicode.org>

Curiously, UnicodeData.txt is lacking the header line. That makes it unflexible.
I never wondered why the header line is missing, probably because compared
to the other UCD files, the file looks really odd without a file header showing 
at least the version number and datestamp. It?s like the file was made up for 
dumb parsers unable to handle comment delimiters, and never to be upgraded
to do so.

But I like the format, and that?s why at some point I submitted feedback asking 
for an extension. [...]

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180830/9d10cc5f/attachment.html>


More information about the Unicode mailing list