Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

Richard Wordingham via Unicode unicode at unicode.org
Fri Dec 8 16:06:19 CST 2017


Apart from the likely but unmandated consequence of making editing
Indic text more difficult (possibly contrary to the UK's Equality Act
2010), there is another difficulty that will follow directly from the
currently proposed expansion of grapheme clusters
(https://www.unicode.org/reports/tr29/proposed.html).

Unless I am missing something, text boundaries have hitherto been
cunningly crafted so that they are not changed by normalisation.
Have I missed something, or has there been a change in policy?

For extended grapheme clusters, the relevant rules are proposed as:

GB9: × 	(Extend | ZWJ | Virama)

GB9c: (Virama | ZWJ ) 	× LinkingConsonant

Most of the Indian scripts have both nukta (ccc=7) and virama (ccc=9).
This would lead canonically equivalent text to have strikingly
different divisions:

<consonant, nukta, virama, consonant> (no break)

but

<consonant, virama, nukta | consonant>

There are other variations on this theme. In Tai Tham, we have the
following conflict:

natural order, no break:

<consonant, non-spacing-vowel, tone-mark, sakot, consonant>

but normalised, there would be a break:

<consonant, non-spacing-vowel, sakot, tone-mark | consonant>

>From reading the text, it seems that it is expected that the presence
or absence of a break should be fine-tuned by CLDR language-specific
rules.  How is this expected to work, e.g. for Saurashtra in Tamil
script?  (There's no Saurashtra data in Version 32 of CLDR.)  Would the
root locale now specify the default segmentation rule, rather than
UAX#29 plus the Unicode Character Database?

Richard.



More information about the Unicode mailing list