UCA question / Produce Collation Element Arrays

Sun Dec 3 16:48:57 CST 2017

On Sun, 3 Dec 2017 20:49:03 +0100
Mark Davis ☕️ via CLDR-Users <cldr-users at unicode.org> wrote:

> Mark <https://twitter.com/mark_e_davis>
> On Sun, Dec 3, 2017 at 8:23 PM, Richard Wordingham via CLDR-Users <
> cldr-users at unicode.org> wrote:
> > On Sun, 3 Dec 2017 13:36:57 +0100
> > Mark Davis ☕️ via CLDR-Users <cldr-users at unicode.org> wrote:
> > So, are you saying that a UCA-conformant implementation can simply
> > reject DUCET for not being well-formed?  

> Well, yes, if they don't use 
> http://unicode.org/reports/tr10/#Well_Formed_DUCET to fix it in one
> way or another. CLDR does do adjustments, for example.

Interesting.  So an implementation can reject the conformance test as
invalid.  It would seem that an implementation that simply prints "DUCET
is not well-formed!" passes the conformance test provided.

What do you mean by 'CLDR does...'?  I have seen ICU wrongly reject
apparently redundant collating elements of a collation - but perhaps I
was doing something wrong.  Do you just mean that the CLDR root
collation includes the ten additions?

> > Alternatively, are you
> > claiming that there is a known, straightforward algorithm to repair
> > any case of non-compliance with WF5 without changing the ordering of
> > strings?

> The algorithm is not defined for non-well-formed strings, so it is
> odd to talk about "without changing the ordering of strings".

I think you've misunderstood my assertion.  By the "ordering of
strings" I mean the order in which they are sorted, not the ordering of
the bytes  within the strings.  I was not talking about strings that
are not well-formed.

> I think
> your main point (above) is that you think that a batch of other
> changes are necessary for it to work for Tibetan. That may be the
> case; I am not that familiar with Tibetan requirements.

No, my new point was that to make DUCET comply with WF5 without
altering the ordering, it requires about 650 additional contractions.
However, only the 10 (really 6) contractions are needed for natural
language strings.  The 650, for example, include four contractions for
each virama, though in natural language there is only one virama that
occurs with Tibetan consonants.

The UCA conformance test includes many strings that do not occur in
natural language, as in the example given in
https://www.unicode.org/Public/UCA/10.0.0/CollationTest.html , namely
0FB2 0F80 0F71 0334, which does not sort equal to 0F77 0334 under DUCET,
but does when just the ten contractions are added.  This pair no longer
appear in the conformance test.

Richard.