From elie.roux at telecom-bretagne.eu  Mon Jun  8 05:54:26 2015
From: elie.roux at telecom-bretagne.eu (=?UTF-8?B?w4lsaWUgUm91eA==?=)
Date: Mon, 08 Jun 2015 12:54:26 +0200
Subject: ignoring characters in collation (for Tibetan)
Message-ID: <55757462.4040401@telecom-bretagne.eu>

Dear all,

When sorting, Tibetan, 0F35 and 0F37 should be completely ignored by the
collation algorithm.

An example with rules for Dzongkha in CLDR:

- line 14 there is ???<???<???
- I want to sort ????, I want it to be equal weight to ???, as 0F37
should be ignored
- when sorting ? ??? ???? ??? ? ? ? (correct order) I get ? ??? ??? ? ?
???? ? (not correct)

so it seems ???? is not treated as equal to ???. Is there any way to
specify this with the current spec/implementation? If I have to
duplicate all collation elements to give them a 0F35/0F37 variant, the
table will just explode (it's already huge).

Thank you very much,
-- 
Elie

From richard.wordingham at ntlworld.com  Mon Jun  8 09:05:42 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 8 Jun 2015 15:05:42 +0100
Subject: ignoring characters in collation (for Tibetan)
In-Reply-To: <55757462.4040401@telecom-bretagne.eu>
References: <55757462.4040401@telecom-bretagne.eu>
Message-ID: <20150608150542.0f408177@JRWUBU2>

On Mon, 08 Jun 2015 12:54:26 +0200
?lie Roux <elie.roux at telecom-bretagne.eu> wrote:

> When sorting, Tibetan, 0F35 and 0F37 should be completely ignored by
> the collation algorithm.
> 
> An example with rules for Dzongkha in CLDR:
> 
> - line 14 there is ???<???<???
> - I want to sort ????, I want it to be equal weight to ???, as 0F37
> should be ignored
> - when sorting ? ??? ???? ??? ? ? ? (correct order) I get ? ??? ??? ?
> ? ???? ? (not correct)
> 
> so it seems ???? is not treated as equal to ???. Is there any way to
> specify this with the current spec/implementation? If I have to
> duplicate all collation elements to give them a 0F35/0F37 variant, the
> table will just explode (it's already huge).

Unless you can use the prefix rule, you will just have to accept the
problem that if characters aren't nearly in the order wanted for
collation, the tables just explode.  However, you may be able to reduce
the number of elements by assuming that words conform to grammar.  For
example, where in the word does Tibetan grammar allow these marks
to go?

Richard.


From elie.roux at telecom-bretagne.eu  Mon Jun  8 10:19:47 2015
From: elie.roux at telecom-bretagne.eu (=?UTF-8?B?w4lsaWUgUm91eA==?=)
Date: Mon, 08 Jun 2015 17:19:47 +0200
Subject: ignoring characters in collation (for Tibetan)
In-Reply-To: <20150608150542.0f408177@JRWUBU2>
References: <55757462.4040401@telecom-bretagne.eu>
 <20150608150542.0f408177@JRWUBU2>
Message-ID: <5575B293.60207@telecom-bretagne.eu>

> Unless you can use the prefix rule,

I admit it's very difficult for me to understand the prefix rule, but
I'm quite sure it can't be applied here...

> you will just have to accept the
> problem that if characters aren't nearly in the order wanted for
> collation, the tables just explode.  

I'll add the 320 new elements to the table then :)

> However, you may be able to reduce
> the number of elements by assuming that words conform to grammar. For
> example, where in the word does Tibetan grammar allow these marks
> to go?

Under the main letter (or stack) of any syllable, these are markers used
(among other things) in the long life prayers to "underline" the name of
the person it is dedicated to, and names can contain any syllable. So
it's kind of like putting something in italic or bold.

Thank you,
-- 
Elie

From richard.wordingham at ntlworld.com  Tue Jun  9 17:43:28 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 9 Jun 2015 23:43:28 +0100
Subject: ignoring characters in collation (for Tibetan)
In-Reply-To: <5575B293.60207@telecom-bretagne.eu>
References: <55757462.4040401@telecom-bretagne.eu>
 <20150608150542.0f408177@JRWUBU2>
 <5575B293.60207@telecom-bretagne.eu>
Message-ID: <20150609234328.5ec74b46@JRWUBU2>

On Mon, 08 Jun 2015 17:19:47 +0200
?lie Roux <elie.roux at telecom-bretagne.eu> wrote:

> > Unless you can use the prefix rule,
> 
> I admit it's very difficult for me to understand the prefix rule, but
> I'm quite sure it can't be applied here...

You may be able to use the 'underlining markers' to recognise that a
consonant is a post-consonant, and build up the weights as weights for
the syllable up to the marker followed by weights for the rest of the
syllable, so having M+N collation entries rather than M?N.  

> > you will just have to accept the
> > problem that if characters aren't nearly in the order wanted for
> > collation, the tables just explode.  
> 
> I'll add the 320 new elements to the table then :)

I experimented with Lao collation for a relatively computer-friendly
collation, one based on CVCT - sort syllable by syllable and
then sort syllables by initial, then by vowel, then by final consonant,
and finally by tone.  I was testing it against a dictionary, but
then discovered that although the dictionary generally sorted initial
syllables correctly, it tended to sort subsequent syllables by Thai
rules. Because neither the UCA nor the CLDR Collation Algorithm has
any accommodation for sorting syllable by syllable, tones have primary
weights when it comes to multi-syllable items.  The commoner Lao
sorting system is based on CCVT, which requires even larger tables - I
would not only have logical order exception (should be 'collation order
exception') vowels and tones, but logical order exception consonants. I
found myself generating tens of thousands of collating elements for
the CVCT system. If I don't avail myself of the fact that only a few
consonants can be final consonants, I generate over 180,000 collating
elements. The problems are:

1) Vowels are written with multiple characters.  This exacerbates the
following problems.
2) The first vowel symbol may precede the initial consonant.  It is not
enough to use a contraction to swap vowel and consonant as in Thai - the
order is also affected by the following vowel characters. 
3) The tone character is stored amongst the vowel characters.
4) Most final consonant characters can be initial consonants.  Initial
and final consonants order differently.  Usually the only way to tell a
final consonant from an initial consonant is that an initial consonant
has a vowel or tone mark next to it.  The CLDR CA does not have suffix
rules.

Richard.


From elie.roux at telecom-bretagne.eu  Wed Jun 10 00:13:17 2015
From: elie.roux at telecom-bretagne.eu (=?UTF-8?B?w4lsaWUgUm91eA==?=)
Date: Wed, 10 Jun 2015 07:13:17 +0200
Subject: ignoring characters in collation (for Tibetan)
In-Reply-To: <20150609234328.5ec74b46@JRWUBU2>
References: <55757462.4040401@telecom-bretagne.eu>
 <20150608150542.0f408177@JRWUBU2> <5575B293.60207@telecom-bretagne.eu>
 <20150609234328.5ec74b46@JRWUBU2>
Message-ID: <5577C76D.7080302@telecom-bretagne.eu>

> You may be able to use the 'underlining markers' to recognise that a
> consonant is a post-consonant, and build up the weights as weights for
> the syllable up to the marker followed by weights for the rest of the
> syllable, so having M+N collation entries rather than M?N.  

I'm not sure it's possible here, let me try to explain why... Let's take
the example of ???, which should have the same weight as ????. It sorts
under the letter ?, as ? is a prefix here. The problem is that "???"
should sort under the letter ?, as in this case ? is a suffix. So the
rule for ? reads something like

&?<???<???

I tried to do something like

&?<???<???<???

but with it I now have

???? < ???

while it should be the other way around... So the only way to fix it
would be

&?<???=????<???=????

I don't think I can use any prefix here?

> I experimented with Lao collation for a relatively computer-friendly
> collation, one based on CVCT

What is CVCT?

> - sort syllable by syllable and
> then sort syllables by initial, then by vowel, then by final consonant,
> and finally by tone.  [...]
> 4) Most final consonant characters can be initial consonants.  Initial
> and final consonants order differently.  Usually the only way to tell a
> final consonant from an initial consonant is that an initial consonant
> has a vowel or tone mark next to it.  The CLDR CA does not have suffix
> rules.

Well, it seems Tibetan is not that hard to sort after all! :)

Thank you!
-- 
Elie