ignoring characters in collation (for Tibetan)

Élie Roux elie.roux at telecom-bretagne.eu
Wed Jun 10 00:13:17 CDT 2015


> You may be able to use the 'underlining markers' to recognise that a
> consonant is a post-consonant, and build up the weights as weights for
> the syllable up to the marker followed by weights for the rest of the
> syllable, so having M+N collation entries rather than M×N.  

I'm not sure it's possible here, let me try to explain why... Let's take
the example of དགན, which should have the same weight as དག༵ན. It sorts
under the letter ག, as ད is a prefix here. The problem is that "དག༵"
should sort under the letter ད, as in this case ག is a suffix. So the
rule for ག reads something like

&ག<དགན<དགལ

I tried to do something like

&ག<དག༵<དགན<དགལ

but with it I now have

དག༵ལ < དགན

while it should be the other way around... So the only way to fix it
would be

&ག<དགན=དག༵ན<དགལ=དག༵ལ

I don't think I can use any prefix here?

> I experimented with Lao collation for a relatively computer-friendly
> collation, one based on CVCT

What is CVCT?

> - sort syllable by syllable and
> then sort syllables by initial, then by vowel, then by final consonant,
> and finally by tone.  [...]
> 4) Most final consonant characters can be initial consonants.  Initial
> and final consonants order differently.  Usually the only way to tell a
> final consonant from an initial consonant is that an initial consonant
> has a vowel or tone mark next to it.  The CLDR CA does not have suffix
> rules.

Well, it seems Tibetan is not that hard to sort after all! :)

Thank you!
-- 
Elie


More information about the Indic mailing list