The Unicode Standard and ISO
Richard Wordingham via Unicode
unicode at unicode.org
Fri Jun 8 19:24:47 CDT 2018
On Fri, 8 Jun 2018 14:14:51 -0700
"Steven R. Loomis via Unicode" <unicode at unicode.org> wrote:
> > But the consortium has formally dropped the commitment to DUCET in
> > CLDR. Even when restricted to strings of assigned characters, the
> > CLDR and ICU no longer make the effort to support the DUCET
> > collation.
> CLDR is not a collation implementation, it is a data repository with
> associated specification. It was never required to 'support' DUCET.
> The contents of CLDR have no bearing on whether implementations
> support DUCET.
DUCET used to be the root collation of CLDR.
> CLDR ≠ ICU.
DUCET is a standard collation. Language-specific collations are
stored in CLDR, so why not an international standard? Does ICU store
collations not defined in CLDR? The formal snag is that the collations
have to be LDML tailorings of the CLDR root collation, which is a
formal problem for U+FDD0. I would expect you to argue that it is more
useful for U+FDD0 to have the special behaviour defined in CLDR, and
restrict conformance with DUCET to characters other than non-characters.
> On Fri, Jun 8, 2018 at 10:41 AM, Richard Wordingham via Unicode <
> unicode at unicode.org> wrote:
> > On Fri, 8 Jun 2018 13:40:21 +0200
> > Mark Davis ☕️ <mark at macchiato.com> wrote:
> > > > The UCA contains features essential for respecting canonical
> > > > equivalence. ICU works hard to avoid the extra effort involved,
> > > > apparently even going to the extreme of implicitly declaring
> > > > that Vietnamese is not a human language.
> > > A bit over the top, eh?
> > Then remove the "no known language" from the bug list
> What does this refer to?
Under the heading "Known Limitations" it says:
"The following are known limitations of the ICU collation
implementation. These are theoretical limitations, however, since there
are no known languages for which these limitations are an issue.
However, for completeness they should be fixed in a future version
after 1.8.1. The examples given are designed for simplicity in testing,
and do not match any real languages."
Then, the particular problem is listed under the heading "Contractions
Spanning Normalization". The assumption is that FCD strings do not
need to be decomposed. This comes unstuck when what is locally a
secondary weight due to a diacritic on a vowel has to be promoted to a
primary weight to support syllable by syllable collation in a system
not set up for such a tiered comparison.
> > …ICU isn't
> > fast enough to load a collation from customisation - it takes
> > hours!
> > ICU is, alas, ridiculously slow
> I'm also curious what this refers to, perhaps it should be a separate
> ICU bug?
There may be reproducibility issues. A proper bug report will take some
work. There's also the argument that nearly 200,000 contractions is
excessive. I had to disable certain checks that were treating "should
not" as a prohibition - working round them either exceeded ICU's
capacity because of the necessary increase in the number of
contractions, or was incompatible with the design of the collation.
The weight customisation creates 45 new weights, with lines like
"&\u0EA1 = \ufdd2\u0e96 < \ufdd2\u0e97 # MO for THO_H & THO_L"
I use strings like \ufdd2\u0e96 to emulate ISO/IEC 14651
(primary) weights. I carefully reuse default Lao weights so as to keep
collating elements' list of collation elements short.
There are a total of 187174 non-comment lines, most being simple
"&\u0ec8\ufdd2\u0e96\ufdd2AAW\ufdd3\u0e94 = \u0ec8\u0e96\u0ead\u0e94 #
1+K+AW+N <ST1><SK><SAAW><SNF> N is mandatory!"
and prefix contractions like
"&\ufdd2AAW\ufdd3\u0e81\u0ec9 = \u0e96\u0ec9 | ອ\u0e81 # K+1|ອ+N
<SAAW><SNF><ST1> N is mandatory".
I strip the comments off as I convert the collation definition to
UTF-16; if I remember correctly I also have to convert escape sequences
to characters. That processing is a negligible part of the time.
By comparison, the loading of 30,000 lines from allkeys.txt is barely
The generation of the loading of the collation was reasonably fast when
I generated DUCET-style collation weights using bash.
For my purposes, I would get better performance if ICU's collation just
blindly converted strings to NFD, but then all I am using it for is to
compare collation rules against a dictionary. I suspect it's just that
I lose out massively as a result of ICU's tradeoffs.
More information about the Unicode