Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues
Mark Davis ☕️ via Unicode
unicode at unicode.org
Sat Dec 9 09:16:44 CST 2017
1. You make a good point about the GB9c. It should probably instead be
GB9c: (Virama | ZWJ ) × Extend* LinkingConsonant
Extend is a broader than necessary, and there are a few items that have
ccc!=0 but not gcb=extend. But all of those look to be degenerate cases.
On Fri, Dec 8, 2017 at 11:06 PM, Richard Wordingham via Unicode <
unicode at unicode.org> wrote:
> Apart from the likely but unmandated consequence of making editing
> Indic text more difficult (possibly contrary to the UK's Equality Act
> 2010), there is another difficulty that will follow directly from the
> currently proposed expansion of grapheme clusters
> Unless I am missing something, text boundaries have hitherto been
> cunningly crafted so that they are not changed by normalisation.
> Have I missed something, or has there been a change in policy?
> For extended grapheme clusters, the relevant rules are proposed as:
> GB9: × (Extend | ZWJ | Virama)
> GB9c: (Virama | ZWJ ) × LinkingConsonant
> Most of the Indian scripts have both nukta (ccc=7) and virama (ccc=9).
> This would lead canonically equivalent text to have strikingly
> different divisions:
> <consonant, nukta, virama, consonant> (no break)
> <consonant, virama, nukta | consonant>
> There are other variations on this theme. In Tai Tham, we have the
> following conflict:
> natural order, no break:
> <consonant, non-spacing-vowel, tone-mark, sakot, consonant>
> but normalised, there would be a break:
> <consonant, non-spacing-vowel, sakot, tone-mark | consonant>
> From reading the text, it seems that it is expected that the presence
> or absence of a break should be fine-tuned by CLDR language-specific
> rules. How is this expected to work, e.g. for Saurashtra in Tamil
> script? (There's no Saurashtra data in Version 32 of CLDR.) Would the
> root locale now specify the default segmentation rule, rather than
> UAX#29 plus the Unicode Character Database?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode