ignoring characters in collation (for Tibetan)

Richard Wordingham richard.wordingham at ntlworld.com
Tue Jun 9 17:43:28 CDT 2015


On Mon, 08 Jun 2015 17:19:47 +0200
Élie Roux <elie.roux at telecom-bretagne.eu> wrote:

> > Unless you can use the prefix rule,
> 
> I admit it's very difficult for me to understand the prefix rule, but
> I'm quite sure it can't be applied here...

You may be able to use the 'underlining markers' to recognise that a
consonant is a post-consonant, and build up the weights as weights for
the syllable up to the marker followed by weights for the rest of the
syllable, so having M+N collation entries rather than M×N.  

> > you will just have to accept the
> > problem that if characters aren't nearly in the order wanted for
> > collation, the tables just explode.  
> 
> I'll add the 320 new elements to the table then :)

I experimented with Lao collation for a relatively computer-friendly
collation, one based on CVCT - sort syllable by syllable and
then sort syllables by initial, then by vowel, then by final consonant,
and finally by tone.  I was testing it against a dictionary, but
then discovered that although the dictionary generally sorted initial
syllables correctly, it tended to sort subsequent syllables by Thai
rules. Because neither the UCA nor the CLDR Collation Algorithm has
any accommodation for sorting syllable by syllable, tones have primary
weights when it comes to multi-syllable items.  The commoner Lao
sorting system is based on CCVT, which requires even larger tables - I
would not only have logical order exception (should be 'collation order
exception') vowels and tones, but logical order exception consonants. I
found myself generating tens of thousands of collating elements for
the CVCT system. If I don't avail myself of the fact that only a few
consonants can be final consonants, I generate over 180,000 collating
elements. The problems are:

1) Vowels are written with multiple characters.  This exacerbates the
following problems.
2) The first vowel symbol may precede the initial consonant.  It is not
enough to use a contraction to swap vowel and consonant as in Thai - the
order is also affected by the following vowel characters. 
3) The tone character is stored amongst the vowel characters.
4) Most final consonant characters can be initial consonants.  Initial
and final consonants order differently.  Usually the only way to tell a
final consonant from an initial consonant is that an initial consonant
has a vowel or tone mark next to it.  The CLDR CA does not have suffix
rules.

Richard.



More information about the Indic mailing list