Deterministic sorting impossible for Tibetan with current state

Richard Wordingham richard.wordingham at ntlworld.com
Tue May 12 13:24:11 CDT 2015


On Tue, 12 May 2015 15:28:04 +0200
Élie Roux <elie.roux at telecom-bretagne.eu> wrote:

> I'm currently working on Tibetan sorting. It mostly works, except for
> this case:
> 
> མངས་
> 
> This unicode sequence can be interpreted in two very different ways,
> both valid in terms of Tibetan language:
> 
> - prefix མ, main letter ང, suffix ས
> - main letter མ, suffix ང, second suffix ས
> 
> Both have their entries in a Tibetan dictionnary: one in the entries
> for letter མ, another (with a different meaning) in the entries for
> letter ང.
> 
> It is thus currently impossible to determine the place of the string
> "མངས་" in a dictionnary (Tibetans guess from the context).
> 
> Are there other languages where this undetermination happens?

Certain examples are rare; it's been claimed that there are none in
Tibetan.  Welsh has this problem, but the closest I could come is
_englyna_ 'to compose' between eg- and eh- versus _engrafu_ 'to
engrave', between enf- and enh-.

> Did they solve that problem?

Where one is a digraph, as with Welsh the letter 'ng', which comes
between 'g' and 'h', the Unicode Collation Algorithm recommends
inserting U+034F COMBINING GRAPHEME JOINER (CGJ).  Soft hyphen will
often do as well, as in the Welsh place name Llangollen, which does not
include the letter 'ng'.

So for your example, I would suggest that as in a lean Tibetan
collation table, <U+0F58 TIBETAN LETTER MA, U+0F44 TIBETAN LETTER NGA>
would be a collating element, that you write _mangs_ as <U+0F58, U+034F,
U+0F44, U+0F66 TIBETAN LETTER SA> and reserve <U+0F58, U+0F44, U+0F66>
for _mngas_.

Richard.



More information about the Indic mailing list