Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

Thu Dec 14 16:40:23 CST 2017

On Mon, 11 Dec 2017 21:45:23 +0000
Cibu Johny (സിബു) <cibu at google.com> wrote:

> I am assuming the purpose of the grapheme cluster definition is to be
> used line spacing, vertical writing or cursor movement. Without
> defining the purpose, it is hard for me to say if a ruleset is valid
> or not. Assuming that purpose driven definition, we probably need
> language specific definitions - a pan-indic algorithm may not work.
> For instance, the proposed ruleset, may not hold good for Tamil. For
> example, see the title in the following image: துக்ளக் broken as
> [ta-u, ka-virama, lla, ka-virama]. However, as per the proposed
> algorithm it would be: [ta-u, ka-virama-lla, ka-virama]
> 
> http://www.chennaispider.com/attachments/Resources/3486-7144-Thuglak-Tamil-Magazine-Chennai.jpg

I think Tamil is actually rather straightforward.  For native
intuition, I would cite the Tamil letter-counting account at
https://venkatarangan.com/blog/content/binary/Counting%20Letters%20in%20an%20Unicode%20String.pdf.
What the author counts is not spacing glyphs, but vowel letters and
consonant characters, with two significant modifications.  Firstly,
K.SSA counts as just one consonant, and SH.R.II is also counted as
containing a single consonant.  In other words, the Tamil virama
character works as a pure killer except in those two environments.
This is also the story the TUNE protagonists tell us.  It will be an
inelegant rule for UAX#29, but, unfortunately, reality is messy.

> Malayalam could be a similar story. In case of Malayalam, it can be
> font specific because of the existence of traditional and reformed
> writing styles. A conjunct might be a ligature in traditional; and it
> might get displayed with explicit virama in the reformed style. For
> example see the poster with word ഉസ്താദ് broken as [u, sa-virama,
> ta-aa, da-virama] - as it is written in the reformed style. As per
> the proposed algorithm, it would be [u, sa-virama-ta-aa, da-virama].
> These breaks would be used by the traditional style of writing.

Working round that seems to be tricky.  The best I can think of is to
have two different locales, traditional and reformed, and hope that the
right font is selected.  It doesn't seem at all straightforward to
work out what the font is doing even from a character to glyph map
without knowing what the glyphs are.  I'm not sure how one should have
the difference designated - language variants, or two scripts?

> 
> [image: image.png]
> https://upload.wikimedia.org/wikipedia/en/6/64/Ustad_Hotel_%282012%29_-_Poster.jpg

> BTW, there is an example with explicit virama in the proposal under
> the Sanskrit section:

The alleged grapheme cluster is the last cluster of the second word in
the Sanskrit section of  L2/17-200 Recommendations to UTC #152
on Text segmentation in Indian languages
(https://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf).  The
rendering seems odd if there is no ZWNJ in the word.  I read the word
as प्प्रप॑द्ये॒  pprpadya with two pitch accents.  However, I can't
explain the visible virama under the DA - even a Hindi font should have
a conjunct for D.YA.

Richard.