UCA question / Produce Collation Element Arrays

Sun Dec 3 13:23:51 CST 2017

On Sun, 3 Dec 2017 13:36:57 +0100
Mark Davis ☕️ via CLDR-Users <cldr-users at unicode.org> wrote:

> The algorithm is predicated on any input table being well formed. (
> http://unicode.org/reports/tr10/#Well-Formed)
> 
> Tibetan is a documented exception in the DUCET, but it also documents
> how to fix it.

But adding the fix does not preserve the order of all strings in
the Tibetan script, only the order of linguistically plausible strings.
The example is the order of the non-defective NFD strings

 ཀྲ྄ཱ 0F40 0FB2 0F84 0F71
 ཀྲ྄ 0F40 0FB2 0F84
 ཀྲཱ 0F40 0FB2 0F71

(I've only added U+0F40 to make the strings non-defective.)

Relevant facts are:

ccc(0F84) = 9
ccc(0F71) = 129
CE(0F71) < CE(0F84)
All relevant collation elements have different, primary weights.

Under DUCET, we get:
Key of OF40 0FB2 OF71      = CE(0F40) CE(OFB2) CE(0F71)
Key of 0F40 0FB2 0F84      = CE(0F40) CE(0FB2) CE(0F84)
Key of 0F40 0FB2 OF84 0F71 = CE(0F40) CE(0FB2) CE(0F84) CE(0F71)

Tailoring DUCET by adding 'all ten' contractions, making a well formed
collation while not perturbing the sorting of Sanskrit, yields a
different order:

Key of OF40 0FB2 OF71      = CE(0F40) CE(OFB2) CE(0F71)
Key of 0F40 0FB2 OF84 0F71 = CE(0F40) CE(0FB2) CE(0F71) CE(0F84)
Key of 0F40 0FB2 0F84      = CE(0F40) CE(0FB2) CE(0F84)

To create a well-formed collation equivalent to DUCET, one has to add
many more contractions - about 650 by my reckoning.

So, are you saying that a UCA-conformant implementation can simply
reject DUCET for not being well-formed?  Alternatively, are you
claiming that there is a known, straightforward algorithm to repair
any case of non-compliance with WF5 without changing the ordering of
strings?

Richard.