Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

Mark Davis ☕️ via Unicode unicode at unicode.org
Mon Dec 11 01:59:20 CST 2017


The proposed rules do not distinguish the different visual forms that a
sequence of characters surrounding a virama can have, such as

   1. an explicit virama, or
   2. a half-form is visible, or
   3. a ligature is created.

That is following the requested structure in
http://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf.

So with these rules a ZWNJ (see Figure 12-3. Preventing Conjunct Forms in
Devanagari <http://www.unicode.org/versions/Unicode10.0.0/ch12.pdf#G14632>)
doesn't
break a GC, nor do instances where a particular script always shows an
explicit virama between two particular consonants. All the lines on Figure
12-7. Consonant Forms in Devanagari and Oriya
<http://www.unicode.org/versions/Unicode10.0.0/ch12.pdf#G59257> having a
virama would have single GCs (that is, all but the first line). [That,
after correcting the rules as per Manish Goregaokar's feedback, thanks!]

The examples in "Annexure B" of 17200-text-seg-rec.pdf
<http://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf> clearly include #2
and #3, but don't have any examples of #1 (as far as I can tell from a
quick scan). It would be very useful to have explicit examples that
included #1, and included scripts other than Devanagari (+swaran,
others). While
the online tool at http://unicode.org/cldr/utility/breaks.jsp can't yet be
used until the Unicode 11 UCD is further along, I have an implementation of
the new rules such that I can take any particular list of words and
generate the breaks. So if someone can supply examples from different
scripts or with different combinations of virama, zwj, zwnj, etc..... I can
push out the result to this list.

And yes, we do need review of these for Malayalam (+cibu, others).

If there are scripts for which the rules really don't work (or need more
research before #29 is finalized in May), it is fairly straightforward to
restrict the rule changes by modifying
http://www.unicode.org/reports/tr29/proposed.html#Virama to either exclude
particular scripts or include only particular scripts.

Mark <https://twitter.com/mark_e_davis>

On Sat, Dec 9, 2017 at 9:30 PM, Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> On Sat, 9 Dec 2017 16:16:44 +0100
> Mark Davis ☕️ via Unicode <unicode at unicode.org> wrote:
>
> > 1. You make a good point about the GB9c. It should probably instead be
> > something like:
> >
> > GB9c: (Virama | ZWJ )   × Extend* LinkingConsonant
> >
> >
> > Extend is a broader than necessary, and there are a few items that
> > have ccc!=0 but not gcb=extend. But all of those look to be
> > degenerate cases.
>
> Something *like*.
>
> Gcb=Extend includes ZWNJ and U+0D02 MALAYALAM SIGN ANUSVARA.  I believe
> these both prevent a preceding candrakkala from extending an akshara -
> see TUS Section 12.9 about Table 12-33.  I think Extend will have to be
> split between starters and non-starters.
>
> I believe there is a problem with the first two examples in Table
> 12-33.  If one suffixed <U+0D15 MALAYALAM LETTER KA, U+0D3E MALAYALAM
> VOWEL SIGN AA> to the first two examples, yielding *പാലു്കാ and
> *എ്ന്നാകാ, one would have three Malayalam aksharas, not two extended
> grapheme clusters as the proposed rules would say. This is different to
> Tai Tham, where there would indeed just be two aksharas in each word,
> albit odd-looking - ᨷᩤᩃᩩ᩠ᨠᩣ and ᩑ᩠ᨶ᩠ᨶᩣᨠᩣ.  Who's checking the impact of
> these changes on Malayalam?
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171211/c5cd779d/attachment.html>


More information about the Unicode mailing list