Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

Mon Dec 11 10:07:05 CST 2017

On Mon, 11 Dec 2017 08:59:20 +0100
Mark Davis ☕️ via Unicode <unicode at unicode.org> wrote:

> The proposed rules do not distinguish the different visual forms that
> a sequence of characters surrounding a virama can have, such as
> 
>    1. an explicit virama, or
>    2. a half-form is visible, or
>    3. a ligature is created.
> 
> That is following the requested structure in
> http://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf.
> 
> So with these rules a ZWNJ (see Figure 12-3. Preventing Conjunct
> Forms in Devanagari
> <http://www.unicode.org/versions/Unicode10.0.0/ch12.pdf#G14632>)
> doesn't break a GC, nor do instances where a particular script always
> shows an explicit virama between two particular consonants. All the
> lines on Figure 12-7. Consonant Forms in Devanagari and Oriya
> <http://www.unicode.org/versions/Unicode10.0.0/ch12.pdf#G59257>
> having a virama would have single GCs (that is, all but the first
> line). [That, after correcting the rules as per Manish Goregaokar's
> feedback, thanks!]
> 
> The examples in "Annexure B" of 17200-text-seg-rec.pdf
> <http://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf> clearly
> include #2 and #3, but don't have any examples of #1 (as far as I can
> tell from a quick scan). It would be very useful to have explicit
> examples that included #1, and included scripts other than Devanagari
> (+swaran, others). While
> the online tool at http://unicode.org/cldr/utility/breaks.jsp can't
> yet be used until the Unicode 11 UCD is further along, I have an
> implementation of the new rules such that I can take any particular
> list of words and generate the breaks. So if someone can supply
> examples from different scripts or with different combinations of
> virama, zwj, zwnj, etc..... I can push out the result to this list.

Tai Tham oddities, which could cause issues with advanced typograpy:

ᩋᩣᩴᨶᩣ᩠ᨲ (Currently C-VN-C-VH-C, becoming C-VN-C-VHC)
ᨾᨶ᩠ᨲᩣ (Currently C-CH-C-V, becoming C-CHC-V)
More obvious versions of the above, with consonants other than
U+1A36 TAI THAM LETTER NA:

ᨠᩣ᩠ᩁ (Currently C-VH-C, becoming C-VHC)
ᩉᩖ᩠ᩅᩣ (Currently CMS-C-V, becoming CMSC-V)

A clear case for tailoring is Pali ᩈᩘᨥᩮᩣ (CM-CVV, but in Laos
and in much Northern Thai usage, U+1A58 TAI THAM SIGN MAI KANG LAI
merits gcb=prepend. Northeastern Thailand has the same style as Laos,
so pi_TH would be far too vague as a locale.)  Compare with Myanmar
script သင်္ဃော (currently C-CHH-CVV, becoming C-CHHCVV), with a pure
killer followed by an invisible stacker.

ᩈᩮᩥᩁ᩠᩺ᨷ (currently CVV-CMH-C, becoming CVV-CHHC) will be a case of
adjacent pure killer and invisible stacker that commute (to use the
terminology of traces).  The more typical commutation problem from Tai
Tham is exemplified by ᨩᩥ᩠᩶ᨶ (currently CVTH-C, becoming CVTHC), where
the tone mark and invisible stacker commute.

I'd like to add the example of Northern Thai Tai Tham ᩉ᩠ᨶᩮᩬᩥ᩠᩵ᨿ <U+1A49
HIGH HA, U+1A60 SAKOT, U+1A36 NA, U+1A6E SIGN E, U+1A6C SIGN OA BELOW,
U+1A65 SIGN I, U+1A75 TONE-1, U+1A60 SAKOT, U+1A3F LOW YA> /nɯai/ 'to
ache all over'.  At present that akshara is split into three grapheme
clusters, composed of 2, 6 and 1 characters. (Thai teaching splits it
into four logically contiguous groups of 3, 3, 1 and 2 characters for
onset, vowel, tone and final consonant.  I find ᩉ᩠ᨶ in native
abecedaries, and the other three all have names, namely mai kuea, mai
yak and hang ya.) When the change goes through, this will be just one
extended grapheme cluster of nine characters.

Moving back to India, I suggest the Tamil example from
https://github.com/w3c/ilreq/issues/31#issuecomment-349589752, namely
யாவற்றையும் (yāvaṟṟaiyum), which currently has an extended grapheme
cluster for each consonant.

At a minimum, we need the Malayalam examples from the TUS.

Finally, I would recommend the Nepali example from
L2/11-370,श्रीमान्‌को, that I brought to the UTC's attention in
L2/17-122.  I hope someone else can deal with the other Devanagari
issues.  (Yep, even Devanagari needs more research!)

> 
> And yes, we do need review of these for Malayalam (+cibu, others).
> 
> If there are scripts for which the rules really don't work (or need
> more research before #29 is finalized in May), it is fairly
> straightforward to restrict the rule changes by modifying
> http://www.unicode.org/reports/tr29/proposed.html#Virama to either
> exclude particular scripts or include only particular scripts.
> 
> Mark <https://twitter.com/mark_e_davis>
> 
> On Sat, Dec 9, 2017 at 9:30 PM, Richard Wordingham via Unicode <
> unicode at unicode.org> wrote:  
> 
> > On Sat, 9 Dec 2017 16:16:44 +0100
> > Mark Davis ☕️ via Unicode <unicode at unicode.org> wrote:
> >  
> > > 1. You make a good point about the GB9c. It should probably
> > > instead be something like:
> > >
> > > GB9c: (Virama | ZWJ )   × Extend* LinkingConsonant
> > >
> > >
> > > Extend is a broader than necessary, and there are a few items that
> > > have ccc!=0 but not gcb=extend. But all of those look to be
> > > degenerate cases.  
> >
> > Something *like*.
> >
> > Gcb=Extend includes ZWNJ and U+0D02 MALAYALAM SIGN ANUSVARA.  I
> > believe these both prevent a preceding candrakkala from extending
> > an akshara - see TUS Section 12.9 about Table 12-33.  I think
> > Extend will have to be split between starters and non-starters.
> >
> > I believe there is a problem with the first two examples in Table
> > 12-33.  If one suffixed <U+0D15 MALAYALAM LETTER KA, U+0D3E
> > MALAYALAM VOWEL SIGN AA> to the first two examples, yielding
> > *പാലു്കാ and *എ്ന്നാകാ, one would have three Malayalam aksharas,
> > not two extended grapheme clusters as the proposed rules would say.
> > This is different to Tai Tham, where there would indeed just be two
> > aksharas in each word, albit odd-looking - ᨷᩤᩃᩩ᩠ᨠᩣ and ᩑ᩠ᨶ᩠ᨶᩣᨠᩣ.
> > Who's checking the impact of these changes on Malayalam?
> >
> > Richard.
> >
> >