Composition / Decomposition of Tibetan oM (0F00)

Richard Wordingham via Indic indic at unicode.org
Sun Mar 18 15:51:26 CDT 2018


On Sun, 18 Mar 2018 11:23:16 +0100
Élie Roux via Indic <indic at unicode.org> wrote:

> Dear All,
> 
> I am wondering why U+0F00 is not indicated as being composed of
> 
> U+0F68 U+0F7C U+0F7E
> 
> which is what a native person would think? Is there supposed to be a
> semantic difference between the two (U+0F00 and this decomposition)?

Sacred syllable v. run of the mill syllable?

> When I see something in a manuscript, how can I know if I should input
> U+0F00 or the decomposition?

I would expect that U+0F00 would have a wider range of glyphs - which
wouldn't help when there is nothing special about the glyph before you.

> My experience is that different input systems will produce one or the
> other so when I'm working on a Tibetan corpus I have to normalize them
> to run some analysis. It seems the normalization I perform
> (decomposing U+0F00) should be part of NFD... why isn't it?

The possibly unhelpful answer is that you should use a collation-like
folding for your searches that can then ignore differences that don't
interest you.  For example, under the UCA default collation, U+0F00 and
<U+0F68, U+0F7C, U+0F7E> are no more different than upper and lower
case in English.  The fact that the UCA default collation gets Tibetan
ordering badly wrong is irrelevant for this application.

Richard.



More information about the Indic mailing list