Unicode Digest, Vol 56, Issue 20

Philippe Verdy via Unicode unicode at unicode.org
Thu Aug 30 16:26:36 CDT 2018

Welel an alternative to XML is JSON which is more compact and
faster/simpler to process; however JSON has no explicit schema, unless the
schema is being made part of the data itself, complicating its structure
(with many levels of arrays of arrays, in which case it becomes less easy
to read by humans, but more adapted to automated processes for fast

I'd say that the XML alone is enough to generate any JSON-derived dataset
that will conform to the schema an application expects to process fast (and
with just the data it can process, excluding various extensions still not
implemetned). But the fastest implementations are also based on data tables
encoded in code (such as DLL or Java classes), or custom database formats
(such as Berkeley dB) generated also automatically from the XML, without
the processing cost of decompression schemes and parsers.

Still today, even if XML is not the usual format used by applications, it
is still the most interoperable format that allows building all sorts of
applications in all sorts of languages: the cost of parsing is left to an
application builder/compiler. Some apps embed the compilers themselves and
use a stored cache for faster processing: this approach allows easy updates
by detecting changes in the XML source, and then downloading them. But in
CLDR such updates are generally not automated : the general scheme evolves
over time and there are complex dependencies to check so that some data
becomes usable (frequently you need to implement some new algorithms to
follow the processing rules documented in CLDR, or to use data not
completely validated, or to allow aplicatioçns to provide their overrides
from insufficiently complete datasets in CLDR, even if CLDR provides a root
locale and applcaitions are supposed to follow the BCP47 fallback
resolution rules; applciations also have their own need about which
language codes they use or need, and CLDR provides many locales that many
applications are still not prepared to render correctly, and many
application users complain if an application is partly translated and
contains too many fallbacks to another language, or worse to another

Le jeu. 30 août 2018 à 20:38, Doug Ewell via Unicode <unicode at unicode.org>
a écrit :

> UnicodeData.txt was devised long before any of the other UCD data files.
> Though it might seem like a simple enhancement to us, adding a header
> block, or even a single line, would break a lot of existing processes that
> were built long ago to parse this file.
> So Unicode can't add a header to this file, and that is the reason the
> format can never be changed (e.g. with more columns). That is why new files
> keep getting created instead.
> The XML format could indeed be expanded with more attributes and more
> subsections. Any process that can parse XML can handle unknown stuff like
> this without misinterpreting the stuff it does know.
> That's why the only two reasonable options for getting UCD data are to
> read all the tab- and semicolon-delimited files, and be ready for new
> files, or just read the XML. Asking for changes to existing UCD file
> formats is kind of a non-starter, given these two alternatives.
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
> -------- Original message --------
> Message: 3
> Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST)
> From: Marcel Schneider via Unicode <unicode at unicode.org>
> Curiously, UnicodeData.txt is lacking the header line. That makes it
> unflexible.
> I never wondered why the header line is missing, probably because compared
> to the other UCD files, the file looks really odd without a file header
> showing
> at least the version number and datestamp. It?s like the file was made up
> for
> dumb parsers unable to handle comment delimiters, and never to be upgraded
> to do so.
> But I like the format, and that?s why at some point I submitted feedback
> asking
> for an extension. [...]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180830/a3900aa9/attachment.html>

More information about the Unicode mailing list