Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

Mon Dec 11 04:16:31 CST 2017

On Sun, 10 Dec 2017 21:14:18 -0800
Manish Goregaokar via Unicode <unicode at unicode.org> wrote:

> > GB9c: (Virama | ZWJ )   × Extend* LinkingConsonant  
> 
> You can also explicitly request ligatureification with a ZWJ, so
> perhaps this rule should be something like
> 
> (Virama ZWJ? | ZWJ) x Extend* LinkingConsonant
> 
> -Manish
> 
> On Sat, Dec 9, 2017 at 7:16 AM, Mark Davis ☕️ via Unicode <
> unicode at unicode.org> wrote:  
> 
> > 1. You make a good point about the GB9c. It should probably instead
> > be something like:
> >
> > GB9c: (Virama | ZWJ )   × Extend* LinkingConsonant

This change is unnecessary.  If we start from Draft 1 where there are:

GB9: 	  	× 	(Extend | ZWJ | Virama)
GB9c: 	(Virama | ZWJ ) 	× 	LinkingConsonant

If the classes used in the rules are to be disjoint, we then have to
split Extend into something like ViramaExtend and OtherExtend to allow
normalised (NFC/NFD) text, at which point we may as well continue to
have rules that work without any normalisation. Informally,

ViramaExtend = Extend and ccc ≠ 0.

OtherExtend = Extend and ccc = 0.

(We might need to put additional characters in ViramaExtend.)

This gives us rules:

GB9': × (OtherExtend | ViramaExtend | ZWJ | Virama)

GB9c': 	(Virama | ZWJ ) ViramaExtend* × LinkingConsonant

So, for a sequence <virama, ZWJ, nukta, LinkingConsonant>, GB9' gives us

virama × ZWJ × nukta LinkingConsonant

and GB9c' gives us

virama × ZWJ × nukta × LinkingConsonant

---
In Rule GB9c, what examples justify including ZWJ?  Are they just the C1
half-forms?  My knowledge suggests that

GB9c'': Virama (ZWJ | ViramaExtend)* × LinkingConsonant

might be more appropriate.

Richard.