Questions about Indic Conjuct Clusters

Wed Apr 17 13:46:47 CDT 2024

It’s not immediately clear from the specification what the correct
implementation would be for a few pathological cases of the Indic Conjuct
Cluster specification in the Unicode 15.1.0 specification.

For convenience’s sake, let’s use the following shorthand:

C = \p{InCB=Consonant}
E = \p{InCB=Extend}
L = \p{InCB=Linker}
M = \p{M}

   1. It appears that both E and L are subsets of M and I think E∪L = M .
   Is this correct? If so, is GB9c equivalent to saying that CM+C should be
   considered a single cluster iff that sequence of characters M+ contains at
   least one character from L? (Having written this question and looking at
   the statement of the rule from
   https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.html,
   my restatement seems to correspond to 9.3 in that list).
   2. Should a sequence like, e.g., CLCLC be considered a single cluster or
   would it be two clusters, CLCL ÷ C?

I would note also that the chart at
https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.html
seems
to be not quite correct.

-dh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20240417/95f9f0a2/attachment.htm>