Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

Richard Wordingham via Unicode unicode at unicode.org
Mon Jan 22 20:34:29 CST 2018


On Sun, 21 Jan 2018 22:34:12 -0800
Mark Davis ☕️ via Unicode <unicode at unicode.org> wrote:
 
> FYI, I'm thinking now that the change should be:
> 
> GB9c: (Virama | ZWJ )   × LinkingConsonant
> =>  
> GB9c: (Virama ViramaExtend* | ZWJ ) × LinkingConsonant
> 
> where ViramaExtend = [Extend - Virama - \p{ccc=0}]
> (This is pre-partitioning.)
> 
> That is close to your formulation, but for for canonical equivalence,
> there shouldn't need to allow the ViramaExtend after ZWJ, because the
> ZWJ has ccc=0, and thus nothing reorders around it.

These look fine.

> Cibu also pointed out on a different thread that for Malayalam we
> need to consider a couple of other forms:
> 
> ... Following contexts should be allowed for requesting reformed or
> traditional conjuncts as per Unicode10.0.0/ch12 page 505.  ...
> 
> /$L ZWNJ $V $L/
> /$L ZWJ $V $L/
> 
> The ZWJ Virama sequence is already provided for by the combination of
> GB9 & GB9c. But not the ZWNJ. If we want to handle that, it would
> mean the addition of something like:
> 
> GB9d: × (ZWNJ ViramaExtend* Virama)

This is OK by me for aksharas.  It might make sense for Tai Tham as
well, where various degrees of binding are attested in what you can
think of as D.DH (as in 'buddha').  If the font formally ligates them
but does not always ligate subscript 'DHA' (i.e. U+1A35 TAI THAM LETTER
LOW THA), <LOW TA, ZWNJ, SAKOT, LOW THA> would provide the unligated
form.  Note than in Tai Tham, SAKOT primarily affects the C2 consonant.

> 
> Cibu also wrote:
> 
> 
> Also, when we disallow /$L $V ZWJ $D/, it is disallowing the sequences
> involving legacy chillus. That is, for example, <CHILLU N, VOWEL SIGN
> E> is a valid sequence (Examples in Unicode10.0.0/ch12 Table 12.36).
> E> It's legacy
> equivalent would be <NA, VIRAMA, ZWJ, VOWEL SIGN E>. It might be OK to
> disallow this; but, we should be mindful of this side effect.

I see no problem here.  By GB9, we get 

NA × VIRAMA × ZWJ SIGN_E

By GB9a, we then get

 NA × VIRAMA × ZWJ × SIGN_E

Have I missed something?

Do you want me to try to formally submit my comments from this post?  I
will be going to bed as soon as I've finished extract comments from
this thread.

Richard.



More information about the Unicode mailing list