UCA question / Produce Collation Element Arrays

Sun Dec 3 13:49:03 CST 2017

Mark <https://twitter.com/mark_e_davis>

On Sun, Dec 3, 2017 at 8:23 PM, Richard Wordingham via CLDR-Users <
cldr-users at unicode.org> wrote:

> On Sun, 3 Dec 2017 13:36:57 +0100
> Mark Davis ☕️ via CLDR-Users <cldr-users at unicode.org> wrote:
>
> > The algorithm is predicated on any input table being well formed. (
> > http://unicode.org/reports/tr10/#Well-Formed)
> >
> > Tibetan is a documented exception in the DUCET, but it also documents
> > how to fix it.
>
> But adding the fix does not preserve the order of all strings in
> the Tibetan script, only the order of linguistically plausible strings.
> The example is the order of the non-defective NFD strings
>
>  ཀྲ྄ཱ 0F40 0FB2 0F84 0F71
>  ཀྲ྄ 0F40 0FB2 0F84
>  ཀྲཱ 0F40 0FB2 0F71
>
> (I've only added U+0F40 to make the strings non-defective.)
>
> Relevant facts are:
>
> ccc(0F84) = 9
> ccc(0F71) = 129
> CE(0F71) < CE(0F84)
> All relevant collation elements have different, primary weights.
>
> Under DUCET, we get:
> Key of OF40 0FB2 OF71      = CE(0F40) CE(OFB2) CE(0F71)
> Key of 0F40 0FB2 0F84      = CE(0F40) CE(0FB2) CE(0F84)
> Key of 0F40 0FB2 OF84 0F71 = CE(0F40) CE(0FB2) CE(0F84) CE(0F71)
>
> Tailoring DUCET by adding 'all ten' contractions, making a well formed
> collation while not perturbing the sorting of Sanskrit, yields a
> different order:
>
> Key of OF40 0FB2 OF71      = CE(0F40) CE(OFB2) CE(0F71)
> Key of 0F40 0FB2 OF84 0F71 = CE(0F40) CE(0FB2) CE(0F71) CE(0F84)
> Key of 0F40 0FB2 0F84      = CE(0F40) CE(0FB2) CE(0F84)
>
> To create a well-formed collation equivalent to DUCET, one has to add
> many more contractions - about 650 by my reckoning.

> So, are you saying that a UCA-conformant implementation can simply
> reject DUCET for not being well-formed?

Well, yes, if they don't use 
http://unicode.org/reports/tr10/#Well_Formed_DUCET to fix it in one way or
another. CLDR does do adjustments, for example.

Alternatively, are you
> claiming that there is a known, straightforward algorithm to repair
> any case of non-compliance with WF5 without changing the ordering of
> strings?
>

The algorithm is not defined for non-well-formed strings, so it is odd to
talk about "without changing the ordering of strings". I think your main
point (above) is that you think that a batch of other changes are necessary
for it to work for Tibetan. That may be the case; I am not that familiar
with Tibetan requirements.

>
> Richard.
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20171203/3adf70bb/attachment.html>