Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

Manish Goregaokar via Unicode unicode at
Fri Dec 22 00:04:37 CST 2017

> When deleting by backspace, the usual practice is to delete one Unicode
character for each key press.

This seems to depend on the operating system and program involved. For
example, on OSX any native text input field (Spotlight, TextEdit, etc) will
delete by extended grapheme cluster. Chrome also deletes by extended
grapheme cluster.

However, Firefox deletes by code point. Or, more accurately, something
codepoint-like. Backspace will delete flag emoji wholesale, but will delete
the jamos in `ᄀᄀᄀ각ᆨᆨ` (a single EGC) one at a time. It also deletes the
variation selector and the heart in `��‍❤️‍��` in a single keystroke.
There's probably a simple metric being used here, but I haven't looked into
it yet.


Overall it seems like there's a different preference for forming clusters
in different scripts. Perhaps we should have a specific "cluster forming
virama" category for viramas from scripts that almost always prefer
clusters? (e.g. devanagari). IIRC some indic scripts prefer explicit virama


On Thu, Dec 21, 2017 at 1:44 PM, Richard Wordingham via Unicode <
unicode at> wrote:

> On Thu, 21 Dec 2017 17:55:33 +0900
> "Martin J. Dürst via Unicode" <unicode at> wrote:
> > On 2017/12/15 07:40, Richard Wordingham via Unicode wrote:
> > > On Mon, 11 Dec 2017 21:45:23 +0000
> > > Cibu Johny (സിബു) <cibu at> wrote:
> > >> For example see the poster with word ഉസ്താദ് broken as [u,
> > >> sa-virama, ta-aa, da-virama] - as it is written in the reformed
> > >> style. As per the proposed algorithm, it would be [u,
> > >> sa-virama-ta-aa, da-virama]. These breaks would be used by the
> > >> traditional style of writing.
> > I'm not at all familiar with Malayalam, but from my experience with
> > typing Japanese (where the average kana character requires two
> > keystrokes for input, but only one for deleting) would lead to
> > different advice. When typing, it is very helpful to know how many
> > times one has to hit backspace when making an error. This kind of
> > knowledge is usually assimilated into what one calls muscle memory,
> > i.e. it is done without thinking about it. I would guess that would
> > be very difficult to maintain two different kinds of muscle memory
> > for typing Malayalam. (My assumption is that the populations typing
> > traditional and reformed writing styles are not disjoint.)
> When deleting by backspace, the usual practice is to delete one Unicode
> character for each key press.  The proposed change to the definition of
> grapheme clusters will not affect this.
> What will change, for some systems, is stepping through Indic text in
> most scripts. (The visual order scripts will be unaffected.)  In Linux
> applications, one can often step to the start of each grapheme cluster,
> i.e. to the breaks in |u|sa-virama|ta-aa|da-virama|. If the proposal to
> expand extended grapheme clusters to whole aksharas goes through, a
> likely effect for traditional Malayalam is that one will only be able to
> step to the positions marked as breaks in
> |u|sa-virama-ta-aa|da-virama|.  Every major system will then be in the
> same position as Windows, where already only the reduced set of cursor
> positions is allowed.  Thus if the 'sa' were mistyped, one would have
> to retype the entire 4-character akshara.  I find this an unpleasant
> prospect, and some Indians already find it extremely annoying not to be
> able to edit the join between consonants, e.g. to replace <virama> by
> <virama, ZWJ>.
> Richard.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list