Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

Manish Goregaokar via Unicode unicode at unicode.org
Sun Dec 10 23:14:18 CST 2017


> GB9c: (Virama | ZWJ )   × Extend* LinkingConsonant

You can also explicitly request ligatureification with a ZWJ, so perhaps
this rule should be something like

(Virama ZWJ? | ZWJ) x Extend* LinkingConsonant

-Manish

On Sat, Dec 9, 2017 at 7:16 AM, Mark Davis ☕️ via Unicode <
unicode at unicode.org> wrote:

> 1. You make a good point about the GB9c. It should probably instead be
> something like:
>
> GB9c: (Virama | ZWJ )   × Extend* LinkingConsonant
>
>
> Extend is a broader than necessary, and there are a few items that have
> ccc!=0 but not gcb=extend. But all of those look to be degenerate cases.
>
> https://unicode.org/cldr/utility/list-unicodeset.jsp?a=
> [\p{ccc!=0}-\p{gcb=extend}]&g=ccc+indicsyllabiccategory
> <https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%5Cp%7Bccc!=0%7D-%5Cp%7Bgcb=extend%7D]&g=ccc+indicsyllabiccategory>
>
>
>
> Mark <https://twitter.com/mark_e_davis>
>
> On Fri, Dec 8, 2017 at 11:06 PM, Richard Wordingham via Unicode <
> unicode at unicode.org> wrote:
>
>> Apart from the likely but unmandated consequence of making editing
>> Indic text more difficult (possibly contrary to the UK's Equality Act
>> 2010), there is another difficulty that will follow directly from the
>> currently proposed expansion of grapheme clusters
>> (https://www.unicode.org/reports/tr29/proposed.html).
>>
>> Unless I am missing something, text boundaries have hitherto been
>> cunningly crafted so that they are not changed by normalisation.
>> Have I missed something, or has there been a change in policy?
>>
>> For extended grapheme clusters, the relevant rules are proposed as:
>>
>> GB9: ×  (Extend | ZWJ | Virama)
>>
>> GB9c: (Virama | ZWJ )   × LinkingConsonant
>>
>> Most of the Indian scripts have both nukta (ccc=7) and virama (ccc=9).
>> This would lead canonically equivalent text to have strikingly
>> different divisions:
>>
>> <consonant, nukta, virama, consonant> (no break)
>>
>> but
>>
>> <consonant, virama, nukta | consonant>
>>
>> There are other variations on this theme. In Tai Tham, we have the
>> following conflict:
>>
>> natural order, no break:
>>
>> <consonant, non-spacing-vowel, tone-mark, sakot, consonant>
>>
>> but normalised, there would be a break:
>>
>> <consonant, non-spacing-vowel, sakot, tone-mark | consonant>
>>
>> From reading the text, it seems that it is expected that the presence
>> or absence of a break should be fine-tuned by CLDR language-specific
>> rules.  How is this expected to work, e.g. for Saurashtra in Tamil
>> script?  (There's no Saurashtra data in Version 32 of CLDR.)  Would the
>> root locale now specify the default segmentation rule, rather than
>> UAX#29 plus the Unicode Character Database?
>>
>> Richard.
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171210/1d4460f0/attachment.html>


More information about the Unicode mailing list