UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)

Marcel Schneider via Unicode unicode at unicode.org
Thu Aug 30 23:58:37 CDT 2018

On 30/08/18 23:34 Philippe Verdy via Unicode wrote:
> Welel an alternative to XML is JSON which is more compact and faster/simpler to process;

Thanks for pointing the problem and the solution alike. Indeed the main drawback of the XML 
format of UCD is that it results in an “insane” filesize. “Insane” was applied to the number
of semicolons in UnicodeData.txt, but that is irrelevant. What is really insane is the filesize
of the XML versions of the UCD. Even without Unihan, it may take up to a minute or so to load 
in a text editor.

> however JSON has no explicit schema, unless the schema is being made part of the data itself,
> complicating its structure (with many levels of arrays of arrays, in which case it becomes
> less easy to read by humans, but more adapted to automated processes for fast processing).
> I'd say that the XML alone is enough to generate any JSON-derived dataset that will conform
> to the schema an application expects to process fast
> (and with just the data it can process, excluding various extensions still not implemetned).
> But the fastest implementations are also based on data tables encoded in code
> (such as DLL or Java classes), or custom database formats (such as Berkeley dB)
> generated also automatically from the XML, without the processing cost of decompression schemes
> and parsers.
> Still today, even if XML is not the usual format used by applications, it is still
> the most interoperable format that allows building all sorts of applications
> in all sorts of languages: the cost of parsing is left to an application builder/compiler.

I’ve tried an online tool to get ucd.nounihan.flat.xml converted to CSV. The tool is great 
and offers a lot of options, but given the “insane” file size, my browser was up for over 
two hours of trouble until I shut down the computer manually. From what I could see in 
the result field, there are many bogus values, meaning that their presence is useless in 
the tags of most characters. And while many attributes have cryptic names in order to keep 
the file size minimal, some attributes have overlong values, ie the design is inconsistent.
Eg in every character we read:
That is bogus. One would need to take them off the tags of most characters, and even 
in the characters where they are relevant, the value would be simply "No". What’s the use 
of abbreviating "Joining Group" to "jg" in the atribute name if in the value it is written out?
And I’m quoting from U+0000. 
Further many values are set to a crosshatch, instead of simply being removed from the 
characters where they are empty. Then the many instances of "undetermined script" 
resulting in *two* attribues with "Zyyy" value. Then in almost each character we’re told that 
it is not a whitespace, not a dash, not a hyphen, and not a quotation mark:
Dash="N" WSpace="N" Hyphen="N" QMark="N"
One couldn’t tell that UCD does actually benefit from the flexibility of XML, given that many 
attributes are systematically present even where they are useless.
Perhaps ucd-*.xml would be two thirds, half, or one third their actual size if they were 
properly designed.

> Some apps embed the compilers themselves and use a stored cache for faster processing:
> this approach allows easy updates by detecting changes in the XML source, and then
> downloading them.
> But in CLDR such updates are generally not automated : the general scheme evolves over time
> and there are complex dependencies to check so that some data becomes usable

Should probably read *un*usable.

> (frequently you need to implement some new algorithms to follow the processing rules
> documented in CLDR, or to use data not completely validated, or to allow aplicatioçns
> to provide their overrides from insufficiently complete datasets in CLDR,
> even if CLDR provides a root locale and applcaitions are supposed to follow the BCP47
> fallback resolution rules;
> applciations also have their own need about which language codes they use or need,
> and CLDR provides many locales that many applications are still not prepared to render correctly,
> and many application users complain if an application is partly translated and contains
> too many fallbacks to another language, or worse to another script).

So the case is even worse than what I could see when looking into CLDR. Many countries, 
including France, don’t care about the data of their own locale in CLDR, but I’m not going 
to vent about that on Unicode Public, because that involves language offices and authorities, 
and would have political entanglements.

Staying technical, I can tell so far about the file header of UnicodeData.txt 
that I can see zero technical reasons not to add it. Processes using the file to generate
an overview of Unicode also use other files and are thus able to process comments correctly,
whereas those processes using UnicodeData to look up character properties provided in the file 
would start searching the code point. (Perhaps there are compilers building DLLs from the file.)

Le jeu. 30 août 2018 à 20:38, Doug Ewell via Unicode  a écrit :

UnicodeData.txt was devised long before any of the other UCD data files. Though it might seem like a simple enhancement to us, adding a header block, or even a single line, would break a lot of existing processes that were built long ago to parse this file.

So Unicode can't add a header to this file, and that is the reason the format can never be changed (e.g. with more columns). That is why new files keep getting created instead.

The XML format could indeed be expanded with more attributes and more subsections. Any process that can parse XML can handle unknown stuff like this without misinterpreting the stuff it does know.

That's why the only two reasonable options for getting UCD data are to read all the tab- and semicolon-delimited files, and be ready for new files, or just read the XML. Asking for changes to existing UCD file formats is kind of a non-starter, given these two alternatives.



Doug Ewell | Thornton, CO, US | ewellic.org


-------- Original message --------
Message: 3

Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST)
> From: Marcel Schneider via Unicode 
Curiously, UnicodeData.txt is lacking the header line. That makes it unflexible.
I never wondered why the header line is missing, probably because compared
to the other UCD files, the file looks really odd without a file header showing 
at least the version number and datestamp. It?s like the file was made up for 
dumb parsers unable to handle comment delimiters, and never to be upgraded
to do so.

But I like the format, and that?s why at some point I submitted feedback asking 
for an extension. [...]

More information about the Unicode mailing list