Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues
Richard Wordingham via Unicode
unicode at unicode.org
Mon Dec 11 19:25:04 CST 2017
On Mon, 11 Dec 2017 21:45:23 +0000
Cibu Johny (സിബു) <cibu at google.com> wrote:
> I am assuming the purpose of the grapheme cluster definition is to be
> used line spacing, vertical writing or cursor movement. Without
> defining the purpose, it is hard for me to say if a ruleset is valid
> or not.
That is a very fair point. Take the example of Thai, an Indic
script which isn't affected by the proposal. There, the spacing
vowel signs, whether before or after, may undergo greater separation
when text is stretched to fill a space. I've seen great separation on
hoardings. The spacing vowel signs are given gc=Lo. Vertical writing
examples are fairly rare, but I've seen 'Yamaha' written vertically in
three horizontal stretches - ยา มา หา. Also, 'video' may be written
vertically in three horizontal stretches, as V D O or as วิ ดี โอ. I'm
not absolutely sure I've the latter in Thai script, but Glenn Slayden
reports it at
The striking thing is that four of these syllables have spacing vowels,
which would be written on their own in writing stretched horizontally,
but associate with the consonant in vertical writing.
I haven't checked on the software-free behaviour of U+0E33 THAI
CHARACTER SARA AM, which is historically a combination of a
mark above and a mark to the right. The Royal Institute Dictionary of
1999 resolves it into NIKHAHIT and SARA AA for what is a very slight
horizontal spacing (e.g. the entry for กระบาล, but I have seen the
NIKHAHIT component still attached to the SARA AA component. However, I
don't know how much control the RID had over the typesetting of the
I think making the proposed change and still saying that cursor motion
should follow the extended grapheme cluster boundaries is contrary to
the Equality Act 2010. It would be knowingly making text editing
harder for the users of most Indic scripts. Those writing a Tai
language in the Tai Tham script would be hit hardest, even if one
mapped compound vowels to simple key stroke sequences.
> Assuming that purpose driven definition, we probably need
> language specific definitions - a pan-indic algorithm may not work.
There is the intermediate level of script-specific definitions. We
already have them - following spacing marks are generally excluded from
the grapheme clusters in the Burmic scripts.
> For instance, the proposed ruleset, may not hold good for Tamil. For
> example, see the title in the following image: துக்ளக் broken as
> [ta-u, ka-virama, lla, ka-virama]. However, as per the proposed
> algorithm it would be: [ta-u, ka-virama-lla, ka-virama]
> [image: image.png]
Thank you for the example. I think the rule for the Tamil script should
be that pulli attaches a following consonant to its grapheme cluster
only in the case of the sequences க்ஷ and ஶ்ரீ, but as I typed the
latter, I was surprised to see the sequence ஶ்ர adopt a conjunct shape,
so I don't know whether I'm seeing variation or a font error.
> Malayalam could be a similar story. In case of Malayalam, it can be
> font specific because of the existence of traditional and reformed
> writing styles. A conjunct might be a ligature in traditional; and it
> might get displayed with explicit virama in the reformed style. For
> example see the poster with word ഉസ്താദ് broken as [u, sa-virama,
> ta-aa, da-virama] - as it is written in the reformed style. As per
> the proposed algorithm, it would be [u, sa-virama-ta-aa, da-virama].
> These breaks would be used by the traditional style of writing.
It seems that the of UAX#29 have been forgotten - "So tailorings for
aksaras may need to be script-, language-, font-, or context-specific
to be useful". The big problem is that virama leaves too much up to
More information about the Unicode