UCD in XML or in CSV? (is: UCD data consumption)

Philippe Verdy via Unicode unicode at unicode.org
Mon Sep 3 14:40:01 CDT 2018

But CSV is only fine for pure tabular data, and the UCD or CDLR data is has
a more complex structure than a simple 2D table. In addition, the schema is
evolving, with new kind of datas added everytime; you cannot keep that
compatibility by adding more empty columns to a single table; adding new
semicolons or other separators to a CSV makes the formaty much less
readable, and in fact it will then contain lot of redundancy.

Like traditional relational databases, these project need a schema and
structure. But if we have to use a RDBMS API, we'll loose the possibility
for using various tools. So these Unicode databases are using collections
of tables and in some cases you need to split a value into multiple ones
with different scoping rules: for that job JSON or XML is fine. But nothing
prevents you to load the existing UCD/CLDR database files into a relational
database and expose the data in different views. But most applications are
in fact built by first laoding this data with a parser specific to the
application, that will convert it to its application-defined schema, and
data can be recompiled in a new form that will then be exposed by an
application API.

XML if then fine ! It has no cost for final users that just use the
generated applications. It's only up to application compiler projects to
parse the data, generate their code, and integrate the data to their API
(there are more useful tools than just "grep'ing the UCD/CLDR datafiles.
Also the UCD and CLDR files are checked by other automated tools that
already parse them, and load them to perform consistency checks and
generate multiple presentations: the important ICU project is built and
maintained for that, it has all the tools needed, plus a reduced API that
can be used directly by final applications. Even some UCD files are now
automatically generated from other source files, they contain automatically
generated reports, Only the initial main UCD file has kept its initial pure
CSV form: it was no longer possible to continue extending this single file,
but compatibility has been preserved and it's a good thing. All others
contain comment lines, and basic report lines.

Le lun. 3 sept. 2018 à 12:16, Adam Borowski via Unicode <unicode at unicode.org>
a écrit :

> On Mon, Sep 03, 2018 at 08:24:06AM +0200, Janusz S. Bień via Unicode wrote:
> > For a non-programmer like me CVS is much more convenient form than XML -
> > I can use it not only with a spreadsheet, but also import directly into
> > a database and analyse with various queries. XML is politically correct,
> > but practically almost unusable without a specialised parser.
> And for a programmer, XML is outright insane.  You need a complex library
> to
> do so, and those fail KISS so badly that you have a CVE roughly yearly.
> On the other hand, writing a parser for current headerless ;-separated data
> completely from scratch is just:
> cut -d';' -f 1,6 </usr/share/unicode/UnicodeData.txt
> or:
> (split/;/)[0,5]
> JSON is somewhat better, but still needs drastically more effort.
> CSV (especially with no escapes) is trivial to handle.
> ᛗᛖᛟᚹ!
> --
> ⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
> ⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17]
> ⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
> ⠈⠳⣄⠀⠀⠀⠀ • use glitches to walk on water [Mt14:25-26]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180903/eddde9f3/attachment.html>

More information about the Unicode mailing list