Lao Collation (was: question about identifying CLDR coverage % for Amharic)

Fri Mar 10 17:16:48 CST 2017

On Thu, 2 Mar 2017 20:11:48 +0000
Richard Wordingham <richard.wordingham at ntlworld.com> wrote:

I think on balance this is a CLDR question rather than an ICU
support question - the clincher is that the answer may be not to use
ICU.

> On Thu, 2 Mar 2017 11:50:27 +0100
> Mark Davis ☕️ <mark at macchiato.com> wrote:

> > Also, would it be possible for you to supply the ordering rules for
> > CLDR?  

> I currently define collation element weights; when I was working on
> the rules, I did not trust the CLDR collation definition format.  I
> will make an effort to convert the rules into the proper format.  If
> I can't work out how to express them succinctly and intelligibly,
> I'll try to create a C program to generate the data.  (I currently
> generate the table using a bash script, and I suspect bash source
> code will not be welcome.)

> 3) I think there is 'too much data'; the last table I generated
> defined 184,674 collation elements for Lao. On the other hand, the
> generating script is just 642 lines long, and 25% of lines are
> comment lines.

I've expressed the table as CLDR tailoring rules, but I am having
difficulty in checking them.  To check them, I need an implementation
that:

1) is cheap (ideally free) to use.
2) interprets LDML the same way as ICU.

It would help if I could be assured that two compliant implementations
of CLDR would give the same results for collation when normalisation is
enabled.  I had hoped to use ICU to test the tailoring; building my
own implementation would be very vulnerable to my misinterpretations of
LDML.

However, I've hit a hard limit with ICU Version 58.2 in the storage of
'CE32's in field CollationDataBuilder::ce32s - 0x7ffff, = 524,287,
elements. I'm not sure where they're coming from, but there are a great
many entries of length 20 to 26.  (In DUCET form, my entries typically
have 3 or 4 CEs.) The limit is connected to the size of bit fields in
ICU, so it would be hard for me to change it.  I hit the limit in the
26,070th expansion of the array in
CollationDataBuilder::encodeExpansion32, which yields an index of
524267, but according to the error details, parsing has only reached
line 7268 of my rules input file.  Apart from the creation of abstract
weights (given below) and comments, my rules input file has one
contraction per line.

I may be able to work round it by testing subsets of the script.

I set up the abstract weights (corresponding to initial consonants,
compound vowels, final consonants and tones) using

# Vowels
&\u0eb9 < \ufdd2ເ < ເ < \ufdd2ແ < ແ < \u0ebb
< ໂ < \ufdd2AW < \ufdd2AAW < \ufdd2OE < \ufdd2OOE <
\ufdd2IA < \ufdd2IIA < \ufdd2UEA < \ufdd2UUEA <
\ufdd2UA < \ufdd2UUA < ໄ < ໃ < \ufdd2AO < ຳ
# Initial consonants
<\ufdd2\u0ede<\ufdd2\u0e81<\ufdd2\u0e82<\ufdd2\u0e84<\ufdd2\u0e87
<\ufdd2\u0e88<\ufdd2\u0eaa<\ufdd2\u0e8a<\ufdd2\u0edf<\ufdd2\u0e8d
<\ufdd2\u0e94<\ufdd2\u0e95<\ufdd2\u0e96<\ufdd2\u0e97<\ufdd2\u0e99
<\ufdd2\u0e9a<\ufdd2\u0e9b<\ufdd2\u0e9c<\ufdd2\u0e9d<\ufdd2\u0e9e
<\ufdd2\u0e9f<\ufdd2\u0ea1<\ufdd2\u0ea2<\ufdd2\u0ea3<\ufdd2\u0ea5
<\ufdd2\u0ea7<\ufdd2\u0eab<\ufdd2\u0ead
<\ufdd2\u0eae
# Tones
<\u0ec8<\u0ec9<\u0eca<\u0ecb
# Final consonants
<\ufdd3\u0e81<\ufdd3\u0e87<\ufdd3\u0edf<\ufdd3\u0e8d<\ufdd3\u0e94
<\ufdd3\u0e99<\ufdd3\u0e9a<\ufdd3\u0ea1<\ufdd3\u0ea7<\ufdd3\u0ebd
# Treat ຽ as ຍ finally - TBC!
&\ufdd3ຍ=\ufdd3ຽ

I don't need to tailor the weights for the vowels U+0EB0 LAO VOWEL SIGN
A to U+0EB9 LAO VOWEL SIGN UU.

I then use contractions like

&\ufdd2\u0eab\ufdd2\u0e8d\ufdd2UUA\ufdd3\u0e9a = \u0eab\u0e8dວ\u0e9a

It is possible to eliminate some contractions - this one captures the
observation, redundant in pure Lao, that in a word starting thus,
U+0E9A LAO LETTER BO would belong to the first syllable, not the
second.  The vowel would be spelt differently in an open syllable.
This matters, because, for example, ການ້ຳ 'kettle' comes before its
string-theoretical prefix ດານ 'work, action', though with the vowel in
this pair, characters beyond the 'ນ' have to be taken into account.
Actually, I think I had already eliminated this contraction, but then
mistranslated from bash to C :-(

So, is there a more capacious alternative to ICU?  Speed matters less
for testing, and even ICU currently takes over 2 hours to choke on my
tailoring, though I think I can see a time-space tradeoff in
CollationDataBuilder::encodeExpansion32 that may speed matters up.
Alternatively, am I accidentally creating overly long contractions? I
hope to eliminate abstract weights, but first I want to check the move
from DUCET-style weight tables to LDML notation.

Richard.