Lao Collation (was: question about identifying CLDR coverage % for Amharic)

Sat Mar 18 16:19:15 CDT 2017

On Thu, 2 Mar 2017 11:50:27 +0100
Mark Davis ☕️ <mark at macchiato.com> wrote:

> Also, would it be possible for you to supply the ordering rules for
> CLDR?

I'm still working on converting the rules from a table of weights to
CLDR rules.  Unfortunately, to get a tolerable number of rules, I seem
to need contexts that start with a tone mark.  It seems wrong to ask
users to mark syllabary boundaries when they are blindingly obvious.

The problem sequence is syllables like ກ້ານ 'stem' <U+0E81 LAO LETTER
KO, U+0EC9 LAO TONE MAI THO, U+0EB2 LAO VOWEL SIGN AA, U+0E99 LAO LETTER
NO>.  In terms of abstract weights, I need to convert it to
<SK><SAA><FN><T2> where these are the weights for the initial
consonant, vowel, final consonant, and tone, in that order.

The big problem in ordering it is that one needs to determine
whether the final NO is syllable-final, or the start of the next
syllable.  If it were the start of the next syllable, the word would
come before ກ້າຫານ 'brave' <U+0E81, U+0EC9, U+0EAB LAO LETTER HO SUNG,
U+0EB2, U+0E99> because NO comes before HO SUNG in alphabetical order.
However, as Lao orders syllable by syllable, we have the order ຫ້າ
'courageous' < ກ້າຫານ < ກ້ານ.

I therefore end up being driven to a series of contractions:

ກ້ > <SK> # Must defer weight for tone mark until final consonant is
known.

້  | າຫາ > <SAA><T2><SHH><SAA> # Identify consonant as starting a
syllable - redundant for ຫ.  Used in ກ້າຫານ.

  ້ | ານ > <SAA><FN><T2> # Would be bled by other contractions, similar
  to the above, if NO started a syllable.  Used in ກ້ານ.

 ້ | າ > Used at the end of an intelligible run of word characters.
 Used by ຫ້າ at the end of a phrase, or if words are being separated
 by ZWSP at input.

For use with ICU, there is an apparently fatal objection, which
I'm currently having difficulty in disabling by hacking ICU.  The
context does not start at an NFC boundary, so ICU rejects the input as
invalid. Now, for these examples, I have to add a consonant to the start
of the context, which increases the number of contextual contractions
thirty-fold.

Now, I may be able to get the number of contractions down by tailoring
more tightly to Lao phonology (with a weather-eye to other Laotian
languages).  Ultimately, it may be possible to secure drastic
reductions by tailoring to the lexicon, with the certainty of some new
words being missorted.

> Longer term, if the rules can be expressed without too much data, I
> think the change should be made in the DUCET; no need for that to
> differ gratuitously from what is acceptable in Laos.

Short of adding Lao tone processing to Quebecois accent processing, I
think Lao tones will have to continue to make secondary differences in
DUCET. The principle of ordering open syllables before closed syllables
with the same initial and vowel may also need too much data.

The 38 modern Lao vowel symbols, simple, compound and including what I
reckon as matres lectionis, should sort as follows: 

ກະ <a ກັ < ກາ < ກິ < ກີ < ກຶ < ກື < ກຸ < ກູ < ເກະ <a ເກັ < ເກ < ແກະ <a
ແກັ < ແກ < ໂກະ <a ກົ < ໂກ < ເກາະ <a ກັອ < ກໍ < ກອ < ເກິ < ເກີ <
ເກັຍ <a ກັຽ < ເກຍ <a ກຽ < ເກຶອ < ເກືອ < ກົວະ <a ກັວ < ກົວ <a ກວ < ໄກ <b
ໃກ < ເກົາ < ກຳ

Latin letters indicate notes, as follows:

a: These could compare primary equal, so long as there was some
mechanism to ensure that open syllables came before closed syllables.

b: A quote of a 1970 decree gives the opposite order!

The current DUCET order is
ກໍ <c,d ກວ <d ກອ <d ກະ < ກັ < ກັວ <d ກັອ < ກັຽ <d ກາ < ກຳ <d ກິ < ກີ <
ກຶ < ກື < ກຸ < ກູ < ກົ < ກົວ <d ກົວະ <d ກຽ <d ເກ < ເກງ* < ເກຍ <d ເກມ*
<d ເກະ < ເກັ < ເກັຍ <d ເກາະ < ເກິ < ເກີ < ເກຶອ < ເກືອ < ເກົາ <d ແກ <d
ແກະ < ແກັ < ໂກ <d ໂກະ < ໃກ <d ໄກ

Notes:

c: The vowel before has only a secondary difference from absence.
d: Wrong by Lao standards.
*: Final consonants have been added to show the interleaving of the
vowels ເກ and ເກຍ.

Now, a few custom rules can clean this up a lot.  I will now give an
example.

The basic idea is to handle the interleaving of vowel symbols starting
with ເ U+0EC0 LAO VOWEL SIGN E by mapping the other symbols to
expansions of ເ.

# Create what would be orphan elements in allkeys.txt.

&  ັ < w < x < y
&າ < a < b < c # 'c' reserved for more ambitious repair.
&  ີ< e < f < g < h # 'g' reserved for more ambitious repair.
&  ື<i < j < k < m < n < p # 'm' reserved for more ambitious repair.

&ກເw = ແກ # A whole family, replacing the one currently used to
'rearrange' ແ U+0EC1 LAO VOWEL SIGN EI.

&ເx = ົ

&ກເy = ໂກ # Another whole family, replacing the one for U+0EC2 LAO VOWEL
SIGN O.

&ເa = ັອ << \u0eb1\u0ec8ອ << \u0eb1\u0ec9ອ << \u0eb1\u0ecaອ <<
\u0eb1\u0ecbອ # Trapped tone mark must be handled.

&ເb = ໍ

&e =  ັຍ << \u0eb1\u0ec8ຍ << \u0eb1\u0ec9ຍ << \u0eb1\u0ecaຍ <<
\u0eb1\u0ecbຍ # Trapped tone mark must be handled.

&ເf=  ັຽ << \u0eb1\u0ec8ຽ << \u0eb1\u0ec9ຽ << \u0eb1\u0ecaຽ <<
\u0eb1\u0ecbຍ # Trapped tone mark must be handled.

&ເh=ຽ

&ເi =  ົວະ  << \u0ebb\u0ec8ວະ << \u0ebb\u0ec9ວະ << \u0ebb\u0ecaວະ <<
\u0ebb\u0ecbວະ # Trapped tone mark should be handled - but the
combinations might not exist in Lao.

&ເj =  ັວ # And the trapped tone marks must also be handled! 

&ເk =  ົວ   # And the trapped tone marks must also be handled! 

&ກເn =  ໄກ   

&ກເp =   ໃກ  

&ເ   <   ຳ # Ideally another 5 rules are need for those who mistype as
<U+0ECD LAO NIGGAHITA, [tone mark,] U+0EB2 LAO VOWEL SIGN AA>.

This improves the ordering to:

 ກວ <e ກອ <e ກະ < ກັ < ກາ < ກິ < ກີ < ກຶ < ກື < ກຸ < ກູ < ເກ < ເກງ* <
 ເກຍ <g ເກມ* <f ເກະ < ເກັ < ແກ <f ແກະ < ແກັ < ກົ < ໂກ <f ໂກະ < ເກາະ <
 ກັອ < ກໍ < ເກິ < ເກີ < ເກັຍ < ກັຽ < ກຽ < ເກຶອ < ເກືອ < ກົວະ < ກັວ <
 ກົວ < ໄກ < ໃກ < ເກົາ < ກຳ

Excuses:

e: Putting ກວ and ກອ in the right places requires syllabary boundary
analysis.  CLDR collation can do it, with manual marking of boundaries
in truly ambiguous cases.  (The marking would be by CGJ or ZWSP.)  The
number of contractions required may be large by the usual standards.
It's possible that this might be doable by UCA collation - making tone
a secondary difference greatly simplifies matters.

f. Putting ເກະ, ເກັ, ແກະ, ແກັ and ໂກະ in the right places would double
the number of rearrangment contractions for Lao.  Additionally,
trapped tone marks might need to be supported for ເກະ, ແກະ, and ໂກະ.
Possibly they could be added as discovered, but that does not accord
with the Unicode policy of DUCET stability.

g. Putting ເກຍ in the right place will require the equivalent of 5
vowels worth of contractions, and will have the additional
complications mentioned in Note (e) above. 

There are also problems with the handling of consonant clusters. DUCET
(mostly) follows through on the compatibility decompositions of
U+0EDC LAO HO NO and U+0EDD LAO HO MO to pairs of consonants.
However, there is another widespread equivalence, that between <U+0EAB
LAO LETTER HO SUNG, U+0EA5 LAO LETTER LO> and between <U+0EAB, U+0EBC
LAO SEMIVOWEL SIGN LO>.  This would best be handled by treating
SIGN LO as having a compatibility decomposition to LETTER LO.

Finally, people rarely appreciate the simplicity of Thai collation.
For Thai, the 'logical exception' vowels are simply swapped with the
immediately following consonant.  For Lao, the 'logical exception'
vowels are swapped with the immediately following consonant *cluster*.
It is particularly worth doing this for the 6 Lao letters that may
be composed of two characters:

1) HO NGO <U+0EAB, U+0E87 LAO LETTER NGO>
2) HO NYO <U+0EAB, U+0E8D LAO LETTER NYO>
3) HO NO  <U+0EAB, U+0E99 LAO LETTER NO> or <U+0EDC LAO HO NO>
4) HO MO  <U+0EAB, U+0EA1 LAO LETTER MO> or <U+0EDD LAO HO MO>
5) HO LO  <U+0EAB, U+0EA5 LAO LETTER LO> or <U+0EAB, U+0EBC LAO
SEMIVOWEL SIGN LO>
6) HO WO  <U+0EAB, U+0EA7 LAO LETTER WO>

At present, DUCET gives the following odd ordering:

ຫງາ < ຫງຳ <A ຫງີ < ຫນາ <<< ໜາ < ເໜ < ຫມາ <<< ໝາ < ເໝ <
< ຫລາ < ຫົດ <B ຫຼາ < ເຫງ < ເຫນ < ເຫມ < ເຫລ <C ເຫຼ < ໂຫດ

A: The vowels are in the wrong order!

B: ຫລາ and ຫຼາ should collate similarly (often, searching for
one ought to find the other) and ຫົດ should not come between them.

C: The nearest Lao has recently come to subscript final consonants is
U+0EBD LAO SEMIVOWEL SIGN NYO as a final consonant.

At least the tertiary differences are correct when they are assigned. 

Richard.