UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)

Fri Aug 31 03:36:45 CDT 2018

To handle the UCD XML file a streaming parser like Expat is necessary.

For codepoints.net I use that data to stuff everything in a MySQL
database. If anyone is interested, the code for that is Open Source:

https://github.com/Codepoints/unicode2mysql/

The example for handling the large XML file can be found here:

https://github.com/Codepoints/unicode2mysql/blob/master/bin/ucd_to_sql.py

For me it's currently much easier to have all the data in a single
place, e.g. a large XML file, than spread over a multitude of files
_with different ad-hoc syntaxes_.

The situation would possibly be different, though, if the UCD data
would be split in several files of the same format. (Be it JSON, CSV,
YAML, XML, TOML, whatever. Just be consistent.)

Nota bene: That is also true for the emoji data, which consists as of
now of five plain text files with similar but not identical formats.

Cheers,
Manuel
Am Fr., 31. Aug. 2018 um 08:19 Uhr schrieb Marius Spix via Unicode
<unicode at unicode.org>:
>
> A good compromise between human readability, machine processability and
> filesize would be using YAML.
>
> Unlike JSON, YAML supports comments, anchors and references, multiple
> documents in a file and several other features.
>
> Regards,
>
> Marius Spix
>
>
> On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via Unicode
> wrote:
>
> > On 30/08/18 23:34 Philippe Verdy via Unicode wrote:
> > >
> > > Welel an alternative to XML is JSON which is more compact and
> > > faster/simpler to process;
> >
> > Thanks for pointing the problem and the solution alike. Indeed the
> > main drawback of the XML format of UCD is that it results in an
> > “insane” filesize. “Insane” was applied to the number of semicolons
> > in UnicodeData.txt, but that is irrelevant. What is really insane is
> > the filesize of the XML versions of the UCD. Even without Unihan, it
> > may take up to a minute or so to load in a text editor.
> >
> > > however JSON has no explicit schema, unless the schema is being
> > > made part of the data itself, complicating its structure (with many
> > > levels of arrays of arrays, in which case it becomes less easy to
> > > read by humans, but more adapted to automated processes for fast
> > > processing).
> > >
> > > I'd say that the XML alone is enough to generate any JSON-derived
> > > dataset that will conform to the schema an application expects to
> > > process fast (and with just the data it can process, excluding
> > > various extensions still not implemetned). But the fastest
> > > implementations are also based on data tables encoded in code (such
> > > as DLL or Java classes), or custom database formats (such as
> > > Berkeley dB) generated also automatically from the XML, without the
> > > processing cost of decompression schemes and parsers.
> > >
> > > Still today, even if XML is not the usual format used by
> > > applications, it is still the most interoperable format that allows
> > > building all sorts of applications in all sorts of languages: the
> > > cost of parsing is left to an application builder/compiler.
> >
> > I’ve tried an online tool to get ucd.nounihan.flat.xml converted to
> > CSV. The tool is great and offers a lot of options, but given the
> > “insane” file size, my browser was up for over two hours of trouble
> > until I shut down the computer manually. From what I could see in the
> > result field, there are many bogus values, meaning that their
> > presence is useless in the tags of most characters. And while many
> > attributes have cryptic names in order to keep the file size minimal,
> > some attributes have overlong values, ie the design is inconsistent.
> > Eg in every character we read: jg="No_Joining_Group" That is bogus.
> > One would need to take them off the tags of most characters, and even
> > in the characters where they are relevant, the value would be simply
> > "No". What’s the use of abbreviating "Joining Group" to "jg" in the
> > atribute name if in the value it is written out? And I’m quoting from
> > U+0000. Further many values are set to a crosshatch, instead of
> > simply being removed from the characters where they are empty. Then
> > the many instances of "undetermined script" resulting in *two*
> > attribues with "Zyyy" value. Then in almost each character we’re told
> > that it is not a whitespace, not a dash, not a hyphen, and not a
> > quotation mark: Dash="N" WSpace="N" Hyphen="N" QMark="N" One couldn’t
> > tell that UCD does actually benefit from the flexibility of XML,
> > given that many attributes are systematically present even where they
> > are useless. Perhaps ucd-*.xml would be two thirds, half, or one
> > third their actual size if they were properly designed.
> >
> > > Some apps embed the compilers themselves and use a stored cache for
> > > faster processing: this approach allows easy updates by detecting
> > > changes in the XML source, and then downloading them.
> > >
> > > But in CLDR such updates are generally not automated : the general
> > > scheme evolves over time and there are complex dependencies to
> > > check so that some data becomes usable
> >
> > Should probably read *un*usable.
> >
> > > (frequently you need to implement some new algorithms to follow the
> > > processing rules documented in CLDR, or to use data not completely
> > > validated, or to allow aplicatioçns to provide their overrides from
> > > insufficiently complete datasets in CLDR, even if CLDR provides a
> > > root locale and applcaitions are supposed to follow the BCP47
> > > fallback resolution rules; applciations also have their own need
> > > about which language codes they use or need, and CLDR provides many
> > > locales that many applications are still not prepared to render
> > > correctly, and many application users complain if an application is
> > > partly translated and contains too many fallbacks to another
> > > language, or worse to another script).
> >
> > So the case is even worse than what I could see when looking into
> > CLDR. Many countries, including France, don’t care about the data of
> > their own locale in CLDR, but I’m not going to vent about that on
> > Unicode Public, because that involves language offices and
> > authorities, and would have political entanglements.
> >
> > Staying technical, I can tell so far about the file header of
> > UnicodeData.txt that I can see zero technical reasons not to add it.
> > Processes using the file to generate an overview of Unicode also use
> > other files and are thus able to process comments correctly, whereas
> > those processes using UnicodeData to look up character properties
> > provided in the file would start searching the code point. (Perhaps
> > there are compilers building DLLs from the file.)
> >
> > Le jeu. 30 août 2018 à 20:38, Doug Ewell via Unicode  a écrit :
> > >
> >
> >
> > UnicodeData.txt was devised long before any of the other UCD data
> > files. Though it might seem like a simple enhancement to us, adding a
> > header block, or even a single line, would break a lot of existing
> > processes that were built long ago to parse this file.
> >
> > >
> > So Unicode can't add a header to this file, and that is the reason
> > the format can never be changed (e.g. with more columns). That is why
> > new files keep getting created instead.
> >
> > >
> > The XML format could indeed be expanded with more attributes and more
> > subsections. Any process that can parse XML can handle unknown stuff
> > like this without misinterpreting the stuff it does know.
> >
> > >
> > That's why the only two reasonable options for getting UCD data are
> > to read all the tab- and semicolon-delimited files, and be ready for
> > new files, or just read the XML. Asking for changes to existing UCD
> > file formats is kind of a non-starter, given these two alternatives.
> >
> > >
> >
> > >
> >
> > >
> > --
> > Doug Ewell | Thornton, CO, US | ewellic.org
> >
> >
> > >
> >
> > -------- Original message --------
> > Message: 3
> >
> > Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST)
> > > From: Marcel Schneider via Unicode
> > >
> > >
> > Curiously, UnicodeData.txt is lacking the header line. That makes it
> > unflexible. I never wondered why the header line is missing, probably
> > because compared to the other UCD files, the file looks really odd
> > without a file header showing at least the version number and
> > datestamp. It?s like the file was made up for dumb parsers unable to
> > handle comment delimiters, and never to be upgraded to do so.
> >
> > But I like the format, and that?s why at some point I submitted
> > feedback asking for an extension. [...]
> >
> >
> >
> >
> >
> >
>