Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

Mark Davis ☕️ via Unicode unicode at
Sat Dec 9 09:16:44 CST 2017

1. You make a good point about the GB9c. It should probably instead be
something like:

GB9c: (Virama | ZWJ )   × Extend* LinkingConsonant

Extend is a broader than necessary, and there are a few items that have
ccc!=0 but not gcb=extend. But all of those look to be degenerate cases.[\p{ccc!=0}-\p{gcb=extend}]&g=ccc+indicsyllabiccategory

Mark <>

On Fri, Dec 8, 2017 at 11:06 PM, Richard Wordingham via Unicode <
unicode at> wrote:

> Apart from the likely but unmandated consequence of making editing
> Indic text more difficult (possibly contrary to the UK's Equality Act
> 2010), there is another difficulty that will follow directly from the
> currently proposed expansion of grapheme clusters
> (
> Unless I am missing something, text boundaries have hitherto been
> cunningly crafted so that they are not changed by normalisation.
> Have I missed something, or has there been a change in policy?
> For extended grapheme clusters, the relevant rules are proposed as:
> GB9: ×  (Extend | ZWJ | Virama)
> GB9c: (Virama | ZWJ )   × LinkingConsonant
> Most of the Indian scripts have both nukta (ccc=7) and virama (ccc=9).
> This would lead canonically equivalent text to have strikingly
> different divisions:
> <consonant, nukta, virama, consonant> (no break)
> but
> <consonant, virama, nukta | consonant>
> There are other variations on this theme. In Tai Tham, we have the
> following conflict:
> natural order, no break:
> <consonant, non-spacing-vowel, tone-mark, sakot, consonant>
> but normalised, there would be a break:
> <consonant, non-spacing-vowel, sakot, tone-mark | consonant>
> From reading the text, it seems that it is expected that the presence
> or absence of a break should be fine-tuned by CLDR language-specific
> rules.  How is this expected to work, e.g. for Saurashtra in Tamil
> script?  (There's no Saurashtra data in Version 32 of CLDR.)  Would the
> root locale now specify the default segmentation rule, rather than
> UAX#29 plus the Unicode Character Database?
> Richard.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list