question about identifying CLDR coverage % for Amharic

Richard Wordingham richard.wordingham at ntlworld.com
Thu Mar 2 14:11:48 CST 2017


On Thu, 2 Mar 2017 11:50:27 +0100
Mark Davis ☕️ <mark at macchiato.com> wrote:

> ​Also, would it be possible for you to supply the ordering rules for
> CLDR?

I currently define collation element weights; when I was working on the
rules, I did not trust the CLDR collation definition format.  I will
make an effort to convert the rules into the proper format.  If I can't
work out how to express them succinctly and intelligibly, I'll try to
create a C program to generate the data.  (I currently generate the
table using a bash script, and I suspect bash source code will not be
welcome.)

> Longer term, if the rules can be expressed without too much data, I
> think the change should be made in the DUCET; no need for that to
> differ gratuitously from ​what is acceptable in Laos.

1) I'm interested to know how you would square the change with the
stability policy: (http://www.unicode.org/collation/ducet-changes.html)

"Changes for characters which have been in the standard for longer than
2 years should generally be disallowed. The UTC can overrule this and
mandate a change in a character weight entry, but should only do so
when it determines that there is an egregious error or finds some other
very strong motivation for disturbing an established value. In less
than such extreme circumstances, solutions involving tailoring should
be preferred."

2) DUCET defines a finite collation element table; Lao needs an
infinite collation element table for use in the Unicode Collation
Algorithm.  The CLDR syntaxes allow finite expression of infinite
tables by means of chaining context-sensitive mappings.

3) I think there is 'too much data'; the last table I generated defined
184,674 collation elements for Lao. On the other hand, the generating script is just 642 lines
long, and 25% of lines are comment lines.

Lao text needs a preprocessing stage in which syllable boundary marks
are inserted.  That would shorten the table considerably, and allow the
UCA to use a finite collation element table.

Richard.



More information about the CLDR-Users mailing list