Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji)

Martin J. Dürst via Unicode unicode at
Tue Oct 22 18:15:57 CDT 2019

Hello Richard, others,

On 2019/10/23 07:32, Richard Wordingham via Unicode wrote:
> On Tue, 22 Oct 2019 23:27:27 +0200
> Daniel Bünzli via Unicode <unicode at> wrote:

>> Just to make things clear. When you say character in your message,
>> you consistently mean scalar value right ?
> Yes.
> I find it hard to imagine that having to type them doesn't endow then
> with some sort of reality in the users' minds, though some, such as
> invisible stackers, are probably envisaged as control characters.

I think this to some extent is a question of "reality in the users' 
minds". But to a very large extent, this is an issue of muscle memory. 
If a user works with a keyboard/input method that deletes a whole 
combination, their muscles will get used to that the same way they will 
get used to the other case.

Users are perfectly capable of talking about characters and in the same 
sentence use that word once for something like individual codepoints and 
later for a whole combination.

> One does come across some odd entry methods, such as typing an Indic
> akshara using the Latin script and then entering it as a whole.  That
> is no more conducive to seeing the constituents as characters than is
> typing wab- to get the hieroglyph ��.

The input of Japanese Kana is usually done from a Latin keyboard. As an 
example, to input the syllable "ka" (か), one presses the keys for 'k' 
and 'a'. In all the IMEs I have used, a backspace deletes the whole
"か", not only the 'a'. One has to get used to it (I still occasionally 
want to press two backspaces when realizing I made a typo), but one gets 
used to it.

There are also cases such as "kya" → "きゃ", where the three Latin 
keyboard presses cannot be allocated 2-1 or 1-2 to the two resulting 
Hiragana. In a sophisticated implementation, a backspace could go from
"きゃ" to "ky", but that would only work immediately after input.

Of course, for Japanese input, Latin → Kana is only the first layer, the 
second layer is Kana → Kanji.

Regards,   Martin.

More information about the Unicode mailing list