Questions about Indic Conjuct Clusters

James Kass jameskass at code2001.com
Wed May 15 16:28:41 CDT 2024



On 2024-04-17 6:46 PM, Don Hosek via Unicode wrote:
> It’s not immediately clear from the specification what the correct 
> implementation would be for a few pathological cases of the Indic 
> Conjuct Cluster specification in the Unicode 15.1.0 specification.
>
> For convenience’s sake, let’s use the following shorthand:
>
> C = \p{InCB=Consonant}
> E = \p{InCB=Extend}
> L = \p{InCB=Linker}
> M = \p{M}
>
>  1. It appears that both E and L are subsets of M and I think E∪L = M
>     . Is this correct? If so, is GB9c equivalent to saying that CM+C
>     should be considered a single cluster iff that sequence of
>     characters M+ contains at least one character from L? (Having
>     written this question and looking at the statement of the rule
>     from
>     https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.html,
>     my restatement seems to correspond to 9.3 in that list).
>  2. Should a sequence like, e.g., CLCLC be considered a single cluster
>     or would it be two clusters, CLCL ÷ C?
>
>
> I would note also that the chart at 
> https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.html seems 
> to be not quite correct.
>
> -dh

One of the binary properties of U+094D DEVANAGARI SIGN VIRAMA (halant) 
is "Grapheme Link".

So, IIUC, "CLCLC" is like Consonant + Virama + Consonant + Virama + 
Consonant, and it should be considered a single grapheme cluster.

Although I know a little bit about Indic conjuncts, I don't have a 
working understanding of the syntax of the page linked above.  So I'm 
"bumping" this post in the hope that someone more knowledgeable will 
respond to the questions.

Meanwhile, here's a link to a Microsoft typography spec page which 
illustrates how the shaping engine determines cluster boundaries (of 
course using OpenType terminology):
https://learn.microsoft.com/en-us/typography/script-development/devanagari

One of the examples on that page is (Ra + halant + Da + halant + Ma + 
I-matra), which is treated as a cluster:  र्द्मि

Hope this is helpful.



More information about the Unicode mailing list