Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

Mark Davis ☕️ via Unicode unicode at unicode.org
Sat Dec 9 09:16:44 CST 2017


1. You make a good point about the GB9c. It should probably instead be
something like:

GB9c: (Virama | ZWJ )   × Extend* LinkingConsonant


Extend is a broader than necessary, and there are a few items that have
ccc!=0 but not gcb=extend. But all of those look to be degenerate cases.

https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\p{ccc!=0}-\p{gcb=extend}]&g=ccc+indicsyllabiccategory



Mark <https://twitter.com/mark_e_davis>

On Fri, Dec 8, 2017 at 11:06 PM, Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> Apart from the likely but unmandated consequence of making editing
> Indic text more difficult (possibly contrary to the UK's Equality Act
> 2010), there is another difficulty that will follow directly from the
> currently proposed expansion of grapheme clusters
> (https://www.unicode.org/reports/tr29/proposed.html).
>
> Unless I am missing something, text boundaries have hitherto been
> cunningly crafted so that they are not changed by normalisation.
> Have I missed something, or has there been a change in policy?
>
> For extended grapheme clusters, the relevant rules are proposed as:
>
> GB9: ×  (Extend | ZWJ | Virama)
>
> GB9c: (Virama | ZWJ )   × LinkingConsonant
>
> Most of the Indian scripts have both nukta (ccc=7) and virama (ccc=9).
> This would lead canonically equivalent text to have strikingly
> different divisions:
>
> <consonant, nukta, virama, consonant> (no break)
>
> but
>
> <consonant, virama, nukta | consonant>
>
> There are other variations on this theme. In Tai Tham, we have the
> following conflict:
>
> natural order, no break:
>
> <consonant, non-spacing-vowel, tone-mark, sakot, consonant>
>
> but normalised, there would be a break:
>
> <consonant, non-spacing-vowel, sakot, tone-mark | consonant>
>
> From reading the text, it seems that it is expected that the presence
> or absence of a break should be fine-tuned by CLDR language-specific
> rules.  How is this expected to work, e.g. for Saurashtra in Tamil
> script?  (There's no Saurashtra data in Version 32 of CLDR.)  Would the
> root locale now specify the default segmentation rule, rather than
> UAX#29 plus the Unicode Character Database?
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171209/b77bcf62/attachment.html>


More information about the Unicode mailing list