UCD in XML or in CSV?

Marcel Schneider via Unicode unicode at unicode.org
Fri Aug 31 23:18:32 CDT 2018


On 31/08/18 19:59 Ken Whistler via Unicode wrote:
[…]
> Second, one of the main obligations of a standards organization is 
> *stability*. People may well object to the ad hoc nature of the UCD data 
> files that have been added over the years -- but it is a *stable* 
> ad-hockery. The worst thing the UTC could do, IMO, would be to keep 
> tweaking formats of data files to meet complaints about one particular 
> parsing inconvenience or another. That would create multiple points of 
> discontinuity between versions -- worse than just having to deal with 
> the ongoing growth in the number of assigned characters and the 
> occasional addition of new data files and properties to the UCD.

I did not want to make trouble asking for moving conventions back and forth.
I liked to learn why UnicodeData.txt was released as a draft without a header 
and nothing, given Unicode knew well in advance that the scheme adopted 
at first release would be kept stable for decades or forever. 

Then I’d like to learn how Unicode came to not devise a consistent scheme
for all the UCD files if any such could be devised, so that people could get 
able to assess whether complaints about inconsistencies are well-founded
or not. It is not enough for me that a given adhockery is stable; IMO it should 
also be well-designed, in responsiveness facing history from a standards body.
That is not what one is telling about UnicodeData.txt, although it is the only 
effectively formatted file in UCD for streamlined processing. Was there not 
enough time to think about a header line and a file header? With the header 
line it would be flexible, and all the problems would be solved if specifying 
that parsers should start with counting the field number prior to creating 
storage arrays. We are lacking a real history of Unicode, explaining why 
everybody was in a hurry. “Authors falling like flies” is the only hint that 
comes to mind.

And given Unicode appear to have missed the hit, to discuss whether it 
would be time to add a more accomplished file for better usability.

> 
> Keep in mind that there is more to processing the UCD than just 
> "latest". People who just focus on grabbing the very latest version of 
> the UCD and updating whatever application they have are missing half the 
> problem. There are multiple tools out there that parse and use multiple 
> *versions* of the UCD. That includes the tooling that is used to 
> maintain the UCD (which parses *all* versions), and the tooling that 
> creates UCD in XML, which also parses all versions. Then there is 
> tooling like unibook, to produce code charts, which also has to adapt to 
> multiple versions, and bidi reference code, which also reads multiple 
> versions of UCD data files. Those are just examples I know off the top 
> of my head. I am sure there are many other instances out there that fit 
> this profile. And none of the applications already built to handle 
> multiple versions would welcome having to permanently build in tracking 
> particular format anomalies between specific versions of the UCD.

That point is clear to me, and even when suggesting to make changes to
BidiMirrored.txt, I had alternatives with a stable existing file and a new 
enhanced file. But what is totally unclear to me is what are old versions 
doing in compiling latest data. Delta is OK, research on particular topic in
old data is OK, but what does it mean to need to parse *all* versions to 
get newest products?
> 
> Third, please remember that folks who come here complaining about the 
> complications of parsing the UCD are a very small percentage of a very 
> small percentage of a very small percentage of interested parties. 
> Nearly everybody who needs UCD data should be consuming it as a 
> secondary source (e.g. for reference via codepoints.net), or as a 
> tertiary source (behind specialized API's, regex, etc.), or as an end 
> user (just getting behavior they expect for characters in applications). 
> Programmers who actually *need* to consume the raw UCD data files and 
> write parsers for them directly should actually be able to deal with the 
> format complexity -- and, if anything, slowing them down to make them 
> think about the reasons for the format complexity might be a good thing, 
> as it tends to put the lie to the easy initial assumption that the UCD 
> is nothing more than a bunch of simple attributes for all the code points.

That makes no sense to me. UCD raw data is and remains a primary source,
I see no way to consume it as a secondary source or as a tertiary source. 
Do you mean to consume it via secondary or tertiary sources? Then we 
actually appear to consume those sources instead of UCD raw data.
These sources are fine for the purpose of getting information about some 
particular code points, but most of these tools I remember don’t allow to 
filter values and compute overviews, nor to add data, as we can do it 
in spreadsheet software. Honestly are we so few people using Excel 
for Unicode data? Even Excel Starter, that I have, is a great tool helping
to perform tasks I fail to get with other tools, even spreadsheet software.
So I beg you to please spare me the tedious conversion from the clumsy
UCD XML file to handy CSV. BTW the former could be cleaned up.

As already said, if this discussion has a positive outcome, a request in 
due form may follow. But I have no time to work out any more papers 
if there is no point.

Regards,

Marcel



More information about the Unicode mailing list