UCD in XML or in CSV?

Fri Aug 31 12:50:08 CDT 2018

On 8/31/2018 1:36 AM, Manuel Strehl via Unicode wrote:
> For codepoints.net I use that data to stuff everything in a MySQL
> database.

Well, for some sense of "everything", anyway. ;-)

People having this discussion should keep in mind a few significant points.

First, the UCD proper isn't "everything", extensive as it is. There are 
also other significant sets of data that the UTC maintains about 
characters in other formats, as well, including the data files 
associated with UTS #46 (IDNA-related), UTS #39 (confusables mapping, 
etc.), UTS #10 (collation), UTR #25 (a set of math-related property 
values), and UTS #51 (emoji-related). The emoji-related data has now 
strayed into the CLDR space, so a significant amount of the information 
about emoji characters is now carried as CLDR tags. And then there is 
various other information about individual characters (or small sets of 
characters) scattered in the core spec -- some in tables, some not, as 
well as mappings to dozens of external standards. There is no actual 
definition anywhere of what "everything" actually is. Further, it is a 
mistake to assume that every character property just associates a simple 
attribute with a code point. There are multiple types of mappings, 
complex relational and set properties, and so forth.

The UTC attempts to keep a fairly clear line around what constitutes the 
"UCD proper" (including Unihan.zip), in part so that it is actually 
possible to run the tools that create the XML version of the UCD, for 
folks who want to consume a more consistent, single-file format version 
of the data. But be aware that that isn't everything -- nor would there 
be much sense in trying to keep expanding the UCD proper to actually 
represent "everything" in one giant DTD.

Second, one of the main obligations of a standards organization is 
*stability*. People may well object to the ad hoc nature of the UCD data 
files that have been added over the years -- but it is a *stable* 
ad-hockery. The worst thing the UTC could do, IMO, would be to keep 
tweaking formats of data files to meet complaints about one particular 
parsing inconvenience or another. That would create multiple points of 
discontinuity between versions -- worse than just having to deal with 
the ongoing growth in the number of assigned characters and the 
occasional addition of new data files and properties to the UCD.

Keep in mind that there is more to processing the UCD than just 
"latest". People who just focus on grabbing the very latest version of 
the UCD and updating whatever application they have are missing half the 
problem. There are multiple tools out there that parse and use multiple 
*versions* of the UCD. That includes the tooling that is used to 
maintain the UCD (which parses *all* versions), and the tooling that 
creates UCD in XML, which also parses all versions. Then there is 
tooling like unibook, to produce code charts, which also has to adapt to 
multiple versions, and bidi reference code, which also reads multiple 
versions of UCD data files. Those are just examples I know off the top 
of my head. I am sure there are many other instances out there that fit 
this profile. And none of the applications already built to handle 
multiple versions would welcome having to permanently build in tracking 
particular format anomalies between specific versions of the UCD.

Third, please remember that folks who come here complaining about the 
complications of parsing the UCD are a very small percentage of a very 
small percentage of a very small percentage of interested parties. 
Nearly everybody who needs UCD data should be consuming it as a 
secondary source (e.g. for reference via codepoints.net), or as a 
tertiary source (behind specialized API's, regex, etc.), or as an end 
user (just getting behavior they expect for characters in applications). 
Programmers who actually *need* to consume the raw UCD data files and 
write parsers for them directly should actually be able to deal with the 
format complexity -- and, if anything, slowing them down to make them 
think about the reasons for the format complexity might be a good thing, 
as it tends to put the lie to the easy initial assumption that the UCD 
is nothing more than a bunch of simple attributes for all the code points.

--Ken