ignoring characters in collation (for Tibetan)

Richard Wordingham richard.wordingham at ntlworld.com
Mon Jun 8 09:05:42 CDT 2015


On Mon, 08 Jun 2015 12:54:26 +0200
Élie Roux <elie.roux at telecom-bretagne.eu> wrote:

> When sorting, Tibetan, 0F35 and 0F37 should be completely ignored by
> the collation algorithm.
> 
> An example with rules for Dzongkha in CLDR:
> 
> - line 14 there is འདན<འདབ<འདམ
> - I want to sort འད༵བ, I want it to be equal weight to འདབ, as 0F37
> should be ignored
> - when sorting ད འདན འད༵བ འདམ ན འ ཡ (correct order) I get ད འདན འདམ ན
> འ འད༵བ ཡ (not correct)
> 
> so it seems འད༵བ is not treated as equal to འདབ. Is there any way to
> specify this with the current spec/implementation? If I have to
> duplicate all collation elements to give them a 0F35/0F37 variant, the
> table will just explode (it's already huge).

Unless you can use the prefix rule, you will just have to accept the
problem that if characters aren't nearly in the order wanted for
collation, the tables just explode.  However, you may be able to reduce
the number of elements by assuming that words conform to grammar.  For
example, where in the word does Tibetan grammar allow these marks
to go?

Richard.



More information about the Indic mailing list