Questions about Indic Conjuct Clusters
James Kass
jameskass at code2001.com
Wed May 15 16:28:41 CDT 2024
On 2024-04-17 6:46 PM, Don Hosek via Unicode wrote:
> It’s not immediately clear from the specification what the correct
> implementation would be for a few pathological cases of the Indic
> Conjuct Cluster specification in the Unicode 15.1.0 specification.
>
> For convenience’s sake, let’s use the following shorthand:
>
> C = \p{InCB=Consonant}
> E = \p{InCB=Extend}
> L = \p{InCB=Linker}
> M = \p{M}
>
> 1. It appears that both E and L are subsets of M and I think E∪L = M
> . Is this correct? If so, is GB9c equivalent to saying that CM+C
> should be considered a single cluster iff that sequence of
> characters M+ contains at least one character from L? (Having
> written this question and looking at the statement of the rule
> from
> https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.html,
> my restatement seems to correspond to 9.3 in that list).
> 2. Should a sequence like, e.g., CLCLC be considered a single cluster
> or would it be two clusters, CLCL ÷ C?
>
>
> I would note also that the chart at
> https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.html seems
> to be not quite correct.
>
> -dh
One of the binary properties of U+094D DEVANAGARI SIGN VIRAMA (halant)
is "Grapheme Link".
So, IIUC, "CLCLC" is like Consonant + Virama + Consonant + Virama +
Consonant, and it should be considered a single grapheme cluster.
Although I know a little bit about Indic conjuncts, I don't have a
working understanding of the syntax of the page linked above. So I'm
"bumping" this post in the hope that someone more knowledgeable will
respond to the questions.
Meanwhile, here's a link to a Microsoft typography spec page which
illustrates how the shaping engine determines cluster boundaries (of
course using OpenType terminology):
https://learn.microsoft.com/en-us/typography/script-development/devanagari
One of the examples on that page is (Ra + halant + Da + halant + Ma +
I-matra), which is treated as a cluster: र्द्मि
Hope this is helpful.
More information about the Unicode
mailing list