UCD in XML or in CSV? (is: UCD data consumption)

Marcel Schneider via Unicode unicode at unicode.org
Sat Sep 1 21:16:07 CDT 2018


I’m not responding without thinking, as I was blamed of when I did,
but it is painful for me to dig into what Ken explained about how 
we should be consuming UCD data. I’ll now try to get some more clarity
into the topic.

> On 31/08/18 19:59 Ken Whistler via Unicode wrote:
> […]
> > 
> > Third, please remember that folks who come here complaining about the 
> > complications of parsing the UCD are a very small percentage of a very 
> > small percentage of a very small percentage of interested parties. 

OK, among avg. 700 list subscribers, relatively few are ever complaining 
about anything, let alone about this particular topic. But we should always 
keep in mind that many folks out there complaining about Unicode don’t come 
here to do so.

> > Nearly everybody who needs UCD data should be consuming it as a 
> > secondary source (e.g. for reference via codepoints.net), or as a 
> > tertiary source (behind specialized API's, regex, etc.),

Like already suggested, “as” should probably read “via” in that part.

> > or as an end 
> > user (just getting behavior they expect for characters in applications). 

That is more than a simple statement about who is consuming UCD data which 
way, as you say “should.” There seem to be assumptions that it is discouraged 
to dive into the raw data; that folks reading file headers are not doing well;
that the data should be assembled only in certain ways; and that ignorant 
people shouldn’t open the UCD cupboard to pick a file they deem useful.

If so, then it might be surprising to know that when submitting a proposal
about Bidi-mirroring mathematical symbols issues feedback
http://www.unicode.org/L2/L2017/17438-bidi-math-fdbk.html
I’d started as a quasi-end-user not getting behavior I expected for characters 
in browsers, as I was spotting characters bidi-mirrored by glyph exchange, like
it is implemented in web browsers, because I wanted that end-users could 
experience bidi-mirroring as it works. Unexpectedly a number of math symbols 
did not mirror, despite many of them being even scalar neighbors.

> > Programmers who actually *need* to consume the raw UCD data files and 
> > write parsers for them directly should actually be able to deal with the 
> > format complexity -- and, if anything, slowing them down to make them 
> > think about the reasons for the format complexity might be a good thing, 

I can see one main reason for the format complexity, and that is that data 
from various propeties don’t necessarily telescope the same way to make for 
small files. The complexity of UCD would then mainly be self-induced by the
way of packing data into one small file per property rather than adding the
value to each relevant code point in one large list as is UnicodeData.txt.

While I’m now taking the time to write this up because I’m committed to 
process that information, we can think of many many people who don’t like 
to be slowed down trying to find out why Unicode changed UCD design while 
following the original idea of a large CSV list would be straightforward, 
eventually by setting up a new one if the first one got stuck. What I can 
figure out is that while a new property was added, that particular property 
was always thought of as being the last one. 
(At some point the many files were then dumped into the known XML files.)

If UCD is to be made of small files, it is necessarily complex, and the 
conclusion is that there should be another large CSV grid to make things 
simple again and lightweight alike so far as they can.

> > as it tends to put the lie to the easy initial assumption that the UCD 
> > is nothing more than a bunch of simple attributes for all the code points.

Did you try the sentence when taking off “simple”? It appears to me as not 
being a lie then. One attribute comes to mind that is so complex that its 
design even changed over time, despite Unicode’s commitment to stability.
The Bidi_Mirrored_Glyph property was originally designed to include “best-fit”
pairs for least-worse display in applications not supporting RTL glyphs 
(ie without OpenType support), with legibility of math formulae in mind.
Later (probably due to a poorly written OpenType spec), no more best-fit pairs 
were added to BidiMirroring.txt, as if OpenType implementers weren’t to remove
the best-fit pairs anyway prior to using the file (while the spec says to use 
it as-is). That led then to the display problem pointed above.

I’m sparing the particular problem related to 3 pairs of symbols with tilde, 
nor to the missing Bidi_Mirroring_Type property, given UTC was not interested. 

So you can understand that I’m not unaware of the complexity of UCD. Though
I don’t think that this could be an argument for not publishing a medium-size 
CSV file with scalar values listed as in UnicodeData.txt.

> 
> […]
> Even Excel Starter, that I have, is a great tool helping
> to perform tasks I fail to get with other tools, even spreadsheet software.

Ie not every spreadsheet software seems to do the job as I need it.

Regards,

Marcel



More information about the Unicode mailing list