UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)

Fri Aug 31 01:19:53 CDT 2018

A good compromise between human readability, machine processability and
filesize would be using YAML.

Unlike JSON, YAML supports comments, anchors and references, multiple
documents in a file and several other features.

Regards,

Marius Spix

On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via Unicode
wrote:

> On 30/08/18 23:34 Philippe Verdy via Unicode wrote:
> >
> > Welel an alternative to XML is JSON which is more compact and
> > faster/simpler to process;
> 
> Thanks for pointing the problem and the solution alike. Indeed the
> main drawback of the XML format of UCD is that it results in an
> “insane” filesize. “Insane” was applied to the number of semicolons
> in UnicodeData.txt, but that is irrelevant. What is really insane is
> the filesize of the XML versions of the UCD. Even without Unihan, it
> may take up to a minute or so to load in a text editor.
> 
> > however JSON has no explicit schema, unless the schema is being
> > made part of the data itself, complicating its structure (with many
> > levels of arrays of arrays, in which case it becomes less easy to
> > read by humans, but more adapted to automated processes for fast
> > processing).
> >
> > I'd say that the XML alone is enough to generate any JSON-derived
> > dataset that will conform to the schema an application expects to
> > process fast (and with just the data it can process, excluding
> > various extensions still not implemetned). But the fastest
> > implementations are also based on data tables encoded in code (such
> > as DLL or Java classes), or custom database formats (such as
> > Berkeley dB) generated also automatically from the XML, without the
> > processing cost of decompression schemes and parsers.
> >
> > Still today, even if XML is not the usual format used by
> > applications, it is still the most interoperable format that allows
> > building all sorts of applications in all sorts of languages: the
> > cost of parsing is left to an application builder/compiler.
> 
> I’ve tried an online tool to get ucd.nounihan.flat.xml converted to
> CSV. The tool is great and offers a lot of options, but given the
> “insane” file size, my browser was up for over two hours of trouble
> until I shut down the computer manually. From what I could see in the
> result field, there are many bogus values, meaning that their
> presence is useless in the tags of most characters. And while many
> attributes have cryptic names in order to keep the file size minimal,
> some attributes have overlong values, ie the design is inconsistent.
> Eg in every character we read: jg="No_Joining_Group" That is bogus.
> One would need to take them off the tags of most characters, and even
> in the characters where they are relevant, the value would be simply
> "No". What’s the use of abbreviating "Joining Group" to "jg" in the
> atribute name if in the value it is written out? And I’m quoting from
> U+0000. Further many values are set to a crosshatch, instead of
> simply being removed from the characters where they are empty. Then
> the many instances of "undetermined script" resulting in *two*
> attribues with "Zyyy" value. Then in almost each character we’re told
> that it is not a whitespace, not a dash, not a hyphen, and not a
> quotation mark: Dash="N" WSpace="N" Hyphen="N" QMark="N" One couldn’t
> tell that UCD does actually benefit from the flexibility of XML,
> given that many attributes are systematically present even where they
> are useless. Perhaps ucd-*.xml would be two thirds, half, or one
> third their actual size if they were properly designed.
> 
> > Some apps embed the compilers themselves and use a stored cache for
> > faster processing: this approach allows easy updates by detecting
> > changes in the XML source, and then downloading them.
> >
> > But in CLDR such updates are generally not automated : the general
> > scheme evolves over time and there are complex dependencies to
> > check so that some data becomes usable
> 
> Should probably read *un*usable.
> 
> > (frequently you need to implement some new algorithms to follow the
> > processing rules documented in CLDR, or to use data not completely
> > validated, or to allow aplicatioçns to provide their overrides from
> > insufficiently complete datasets in CLDR, even if CLDR provides a
> > root locale and applcaitions are supposed to follow the BCP47
> > fallback resolution rules; applciations also have their own need
> > about which language codes they use or need, and CLDR provides many
> > locales that many applications are still not prepared to render
> > correctly, and many application users complain if an application is
> > partly translated and contains too many fallbacks to another
> > language, or worse to another script).
> 
> So the case is even worse than what I could see when looking into
> CLDR. Many countries, including France, don’t care about the data of
> their own locale in CLDR, but I’m not going to vent about that on
> Unicode Public, because that involves language offices and
> authorities, and would have political entanglements.
> 
> Staying technical, I can tell so far about the file header of
> UnicodeData.txt that I can see zero technical reasons not to add it.
> Processes using the file to generate an overview of Unicode also use
> other files and are thus able to process comments correctly, whereas
> those processes using UnicodeData to look up character properties
> provided in the file would start searching the code point. (Perhaps
> there are compilers building DLLs from the file.)
> 
> Le jeu. 30 août 2018 à 20:38, Doug Ewell via Unicode  a écrit :
> >
> 
> 
> UnicodeData.txt was devised long before any of the other UCD data
> files. Though it might seem like a simple enhancement to us, adding a
> header block, or even a single line, would break a lot of existing
> processes that were built long ago to parse this file.
> 
> >
> So Unicode can't add a header to this file, and that is the reason
> the format can never be changed (e.g. with more columns). That is why
> new files keep getting created instead.
> 
> >
> The XML format could indeed be expanded with more attributes and more
> subsections. Any process that can parse XML can handle unknown stuff
> like this without misinterpreting the stuff it does know.
> 
> >
> That's why the only two reasonable options for getting UCD data are
> to read all the tab- and semicolon-delimited files, and be ready for
> new files, or just read the XML. Asking for changes to existing UCD
> file formats is kind of a non-starter, given these two alternatives.
> 
> >
> 
> >
> 
> >
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
> 
> 
> >
> 
> -------- Original message --------
> Message: 3
> 
> Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST)
> > From: Marcel Schneider via Unicode 
> > 
> >
> Curiously, UnicodeData.txt is lacking the header line. That makes it
> unflexible. I never wondered why the header line is missing, probably
> because compared to the other UCD files, the file looks really odd
> without a file header showing at least the version number and
> datestamp. It?s like the file was made up for dumb parsers unable to
> handle comment delimiters, and never to be upgraded to do so.
> 
> But I like the format, and that?s why at some point I submitted
> feedback asking for an extension. [...]
> 
> 
> 
> 
> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: Digitale Signatur von OpenPGP
URL: <http://unicode.org/pipermail/unicode/attachments/20180831/61f21b73/attachment.pgp>