UCD in XML or in CSV?
Ken Whistler via Unicode
unicode at unicode.org
Fri Aug 31 12:50:08 CDT 2018
On 8/31/2018 1:36 AM, Manuel Strehl via Unicode wrote:
> For codepoints.net I use that data to stuff everything in a MySQL
Well, for some sense of "everything", anyway. ;-)
People having this discussion should keep in mind a few significant points.
First, the UCD proper isn't "everything", extensive as it is. There are
also other significant sets of data that the UTC maintains about
characters in other formats, as well, including the data files
associated with UTS #46 (IDNA-related), UTS #39 (confusables mapping,
etc.), UTS #10 (collation), UTR #25 (a set of math-related property
values), and UTS #51 (emoji-related). The emoji-related data has now
strayed into the CLDR space, so a significant amount of the information
about emoji characters is now carried as CLDR tags. And then there is
various other information about individual characters (or small sets of
characters) scattered in the core spec -- some in tables, some not, as
well as mappings to dozens of external standards. There is no actual
definition anywhere of what "everything" actually is. Further, it is a
mistake to assume that every character property just associates a simple
attribute with a code point. There are multiple types of mappings,
complex relational and set properties, and so forth.
The UTC attempts to keep a fairly clear line around what constitutes the
"UCD proper" (including Unihan.zip), in part so that it is actually
possible to run the tools that create the XML version of the UCD, for
folks who want to consume a more consistent, single-file format version
of the data. But be aware that that isn't everything -- nor would there
be much sense in trying to keep expanding the UCD proper to actually
represent "everything" in one giant DTD.
Second, one of the main obligations of a standards organization is
*stability*. People may well object to the ad hoc nature of the UCD data
files that have been added over the years -- but it is a *stable*
ad-hockery. The worst thing the UTC could do, IMO, would be to keep
tweaking formats of data files to meet complaints about one particular
parsing inconvenience or another. That would create multiple points of
discontinuity between versions -- worse than just having to deal with
the ongoing growth in the number of assigned characters and the
occasional addition of new data files and properties to the UCD.
Keep in mind that there is more to processing the UCD than just
"latest". People who just focus on grabbing the very latest version of
the UCD and updating whatever application they have are missing half the
problem. There are multiple tools out there that parse and use multiple
*versions* of the UCD. That includes the tooling that is used to
maintain the UCD (which parses *all* versions), and the tooling that
creates UCD in XML, which also parses all versions. Then there is
tooling like unibook, to produce code charts, which also has to adapt to
multiple versions, and bidi reference code, which also reads multiple
versions of UCD data files. Those are just examples I know off the top
of my head. I am sure there are many other instances out there that fit
this profile. And none of the applications already built to handle
multiple versions would welcome having to permanently build in tracking
particular format anomalies between specific versions of the UCD.
Third, please remember that folks who come here complaining about the
complications of parsing the UCD are a very small percentage of a very
small percentage of a very small percentage of interested parties.
Nearly everybody who needs UCD data should be consuming it as a
secondary source (e.g. for reference via codepoints.net), or as a
tertiary source (behind specialized API's, regex, etc.), or as an end
user (just getting behavior they expect for characters in applications).
Programmers who actually *need* to consume the raw UCD data files and
write parsers for them directly should actually be able to deal with the
format complexity -- and, if anything, slowing them down to make them
think about the reasons for the format complexity might be a good thing,
as it tends to put the lie to the easy initial assumption that the UCD
is nothing more than a bunch of simple attributes for all the code points.
More information about the Unicode