From elie.roux at telecom-bretagne.eu Mon Jun 8 05:54:26 2015 From: elie.roux at telecom-bretagne.eu (=?UTF-8?B?w4lsaWUgUm91eA==?=) Date: Mon, 08 Jun 2015 12:54:26 +0200 Subject: ignoring characters in collation (for Tibetan) Message-ID: <55757462.4040401@telecom-bretagne.eu> Dear all, When sorting, Tibetan, 0F35 and 0F37 should be completely ignored by the collation algorithm. An example with rules for Dzongkha in CLDR: - line 14 there is ??? References: <55757462.4040401@telecom-bretagne.eu> Message-ID: <20150608150542.0f408177@JRWUBU2> On Mon, 08 Jun 2015 12:54:26 +0200 ?lie Roux wrote: > When sorting, Tibetan, 0F35 and 0F37 should be completely ignored by > the collation algorithm. > > An example with rules for Dzongkha in CLDR: > > - line 14 there is ??? - I want to sort ????, I want it to be equal weight to ???, as 0F37 > should be ignored > - when sorting ? ??? ???? ??? ? ? ? (correct order) I get ? ??? ??? ? > ? ???? ? (not correct) > > so it seems ???? is not treated as equal to ???. Is there any way to > specify this with the current spec/implementation? If I have to > duplicate all collation elements to give them a 0F35/0F37 variant, the > table will just explode (it's already huge). Unless you can use the prefix rule, you will just have to accept the problem that if characters aren't nearly in the order wanted for collation, the tables just explode. However, you may be able to reduce the number of elements by assuming that words conform to grammar. For example, where in the word does Tibetan grammar allow these marks to go? Richard. From elie.roux at telecom-bretagne.eu Mon Jun 8 10:19:47 2015 From: elie.roux at telecom-bretagne.eu (=?UTF-8?B?w4lsaWUgUm91eA==?=) Date: Mon, 08 Jun 2015 17:19:47 +0200 Subject: ignoring characters in collation (for Tibetan) In-Reply-To: <20150608150542.0f408177@JRWUBU2> References: <55757462.4040401@telecom-bretagne.eu> <20150608150542.0f408177@JRWUBU2> Message-ID: <5575B293.60207@telecom-bretagne.eu> > Unless you can use the prefix rule, I admit it's very difficult for me to understand the prefix rule, but I'm quite sure it can't be applied here... > you will just have to accept the > problem that if characters aren't nearly in the order wanted for > collation, the tables just explode. I'll add the 320 new elements to the table then :) > However, you may be able to reduce > the number of elements by assuming that words conform to grammar. For > example, where in the word does Tibetan grammar allow these marks > to go? Under the main letter (or stack) of any syllable, these are markers used (among other things) in the long life prayers to "underline" the name of the person it is dedicated to, and names can contain any syllable. So it's kind of like putting something in italic or bold. Thank you, -- Elie From richard.wordingham at ntlworld.com Tue Jun 9 17:43:28 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 9 Jun 2015 23:43:28 +0100 Subject: ignoring characters in collation (for Tibetan) In-Reply-To: <5575B293.60207@telecom-bretagne.eu> References: <55757462.4040401@telecom-bretagne.eu> <20150608150542.0f408177@JRWUBU2> <5575B293.60207@telecom-bretagne.eu> Message-ID: <20150609234328.5ec74b46@JRWUBU2> On Mon, 08 Jun 2015 17:19:47 +0200 ?lie Roux wrote: > > Unless you can use the prefix rule, > > I admit it's very difficult for me to understand the prefix rule, but > I'm quite sure it can't be applied here... You may be able to use the 'underlining markers' to recognise that a consonant is a post-consonant, and build up the weights as weights for the syllable up to the marker followed by weights for the rest of the syllable, so having M+N collation entries rather than M?N. > > you will just have to accept the > > problem that if characters aren't nearly in the order wanted for > > collation, the tables just explode. > > I'll add the 320 new elements to the table then :) I experimented with Lao collation for a relatively computer-friendly collation, one based on CVCT - sort syllable by syllable and then sort syllables by initial, then by vowel, then by final consonant, and finally by tone. I was testing it against a dictionary, but then discovered that although the dictionary generally sorted initial syllables correctly, it tended to sort subsequent syllables by Thai rules. Because neither the UCA nor the CLDR Collation Algorithm has any accommodation for sorting syllable by syllable, tones have primary weights when it comes to multi-syllable items. The commoner Lao sorting system is based on CCVT, which requires even larger tables - I would not only have logical order exception (should be 'collation order exception') vowels and tones, but logical order exception consonants. I found myself generating tens of thousands of collating elements for the CVCT system. If I don't avail myself of the fact that only a few consonants can be final consonants, I generate over 180,000 collating elements. The problems are: 1) Vowels are written with multiple characters. This exacerbates the following problems. 2) The first vowel symbol may precede the initial consonant. It is not enough to use a contraction to swap vowel and consonant as in Thai - the order is also affected by the following vowel characters. 3) The tone character is stored amongst the vowel characters. 4) Most final consonant characters can be initial consonants. Initial and final consonants order differently. Usually the only way to tell a final consonant from an initial consonant is that an initial consonant has a vowel or tone mark next to it. The CLDR CA does not have suffix rules. Richard. From elie.roux at telecom-bretagne.eu Wed Jun 10 00:13:17 2015 From: elie.roux at telecom-bretagne.eu (=?UTF-8?B?w4lsaWUgUm91eA==?=) Date: Wed, 10 Jun 2015 07:13:17 +0200 Subject: ignoring characters in collation (for Tibetan) In-Reply-To: <20150609234328.5ec74b46@JRWUBU2> References: <55757462.4040401@telecom-bretagne.eu> <20150608150542.0f408177@JRWUBU2> <5575B293.60207@telecom-bretagne.eu> <20150609234328.5ec74b46@JRWUBU2> Message-ID: <5577C76D.7080302@telecom-bretagne.eu> > You may be able to use the 'underlining markers' to recognise that a > consonant is a post-consonant, and build up the weights as weights for > the syllable up to the marker followed by weights for the rest of the > syllable, so having M+N collation entries rather than M?N. I'm not sure it's possible here, let me try to explain why... Let's take the example of ???, which should have the same weight as ????. It sorts under the letter ?, as ? is a prefix here. The problem is that "???" should sort under the letter ?, as in this case ? is a suffix. So the rule for ? reads something like &? I experimented with Lao collation for a relatively computer-friendly > collation, one based on CVCT What is CVCT? > - sort syllable by syllable and > then sort syllables by initial, then by vowel, then by final consonant, > and finally by tone. [...] > 4) Most final consonant characters can be initial consonants. Initial > and final consonants order differently. Usually the only way to tell a > final consonant from an initial consonant is that an initial consonant > has a vowel or tone mark next to it. The CLDR CA does not have suffix > rules. Well, it seems Tibetan is not that hard to sort after all! :) Thank you! -- Elie