From indic at unicode.org Sun Mar 18 05:23:16 2018 From: indic at unicode.org (=?UTF-8?Q?=c3=89lie_Roux?= via Indic) Date: Sun, 18 Mar 2018 11:23:16 +0100 Subject: Composition / Decomposition of Tibetan oM (0F00) Message-ID: <71c75887-ce61-2cf6-a7cc-c750defd2713@telecom-bretagne.eu> Dear All, I am wondering why U+0F00 is not indicated as being composed of U+0F68 U+0F7C U+0F7E which is what a native person would think? Is there supposed to be a semantic difference between the two (U+0F00 and this decomposition)? When I see something in a manuscript, how can I know if I should input U+0F00 or the decomposition? My experience is that different input systems will produce one or the other so when I'm working on a Tibetan corpus I have to normalize them to run some analysis. It seems the normalization I perform (decomposing U+0F00) should be part of NFD... why isn't it? The same question holds for the (less common) U+0F02 = U+0F60 U+0F74 U+0F82 U+0F7F U+0F03 = U+0F60 U+0F74 U+0F82 U+0F14 Thank you, -- Elie From indic at unicode.org Sun Mar 18 15:51:26 2018 From: indic at unicode.org (Richard Wordingham via Indic) Date: Sun, 18 Mar 2018 20:51:26 +0000 Subject: Composition / Decomposition of Tibetan oM (0F00) In-Reply-To: <71c75887-ce61-2cf6-a7cc-c750defd2713@telecom-bretagne.eu> References: <71c75887-ce61-2cf6-a7cc-c750defd2713@telecom-bretagne.eu> Message-ID: <20180318205126.383e1f01@JRWUBU2> On Sun, 18 Mar 2018 11:23:16 +0100 ?lie Roux via Indic wrote: > Dear All, > > I am wondering why U+0F00 is not indicated as being composed of > > U+0F68 U+0F7C U+0F7E > > which is what a native person would think? Is there supposed to be a > semantic difference between the two (U+0F00 and this decomposition)? Sacred syllable v. run of the mill syllable? > When I see something in a manuscript, how can I know if I should input > U+0F00 or the decomposition? I would expect that U+0F00 would have a wider range of glyphs - which wouldn't help when there is nothing special about the glyph before you. > My experience is that different input systems will produce one or the > other so when I'm working on a Tibetan corpus I have to normalize them > to run some analysis. It seems the normalization I perform > (decomposing U+0F00) should be part of NFD... why isn't it? The possibly unhelpful answer is that you should use a collation-like folding for your searches that can then ignore differences that don't interest you. For example, under the UCA default collation, U+0F00 and are no more different than upper and lower case in English. The fact that the UCA default collation gets Tibetan ordering badly wrong is irrelevant for this application. Richard. From indic at unicode.org Mon Mar 19 04:13:39 2018 From: indic at unicode.org (=?UTF-8?Q?=c3=89lie_Roux?= via Indic) Date: Mon, 19 Mar 2018 10:13:39 +0100 Subject: Composition / Decomposition of Tibetan oM (0F00) In-Reply-To: <20180318205126.383e1f01@JRWUBU2> References: <71c75887-ce61-2cf6-a7cc-c750defd2713@telecom-bretagne.eu> <20180318205126.383e1f01@JRWUBU2> Message-ID: <8dba7890-b6a6-92a1-a503-98647c63d841@telecom-bretagne.eu> Dear Richard, > Sacred syllable v. run of the mill syllable? Hmm, ok let's ask more direct questions, which are on two different aspects of the problem: 1. There are a lot of sacred syllables in Tibetan, why choose this one in particular? Hung (U+0F67 U+0F71 U+0F74 U+0F82) is at least as sacred and as widespread... 2. Why isn't U+0F00 considered a composition of U+0F68 U+0F7C U+0F7E in UnicodeData.txt? What I see is: 0F00;TIBETAN SYLLABLE OM;Lo;0;L;;;;;N;;;;; while I believe it should contain 0F00;TIBETAN SYLLABLE OM;Lo;0;L;0F68 0F7C 0F7E;;;;N;;;;; (same for 0F02 and 0F03). > For example, under the UCA default collation, U+0F00 and U+0F7C, U+0F7E> are no more different than upper and lower case in > English. Hmmm thanks a lot for that! This seems to be somewhat new, but indeed I can see 0F00 ; [.2F19.0020.0004][.2F30.0020.0004][.0000.00C4.0004] # TIBETAN SYLLABLE OM in http://www.unicode.org/Public/UCA/10.0.0/allkeys.txt So I guess I'm even more eager to have some clues on my question number 2, if the UCA acknowledges that the composed and decomposed characters have the same weight, why doesn't UnicodeData list them as composition/decomposition? Thank you, -- Elie