Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji)

Tue Oct 22 15:44:10 CDT 2019

On Tue, 22 Oct 2019 11:04:01 +0200
Daniel Bünzli via Unicode <unicode at unicode.org> wrote:

> On 22 October 2019 at 09:37:22, Richard Wordingham via Unicode
> (unicode at unicode.org) wrote:
> 
> > When it comes to the second sentence of the text of Slide 7
> > 'Grapheme Clusters', my overwhelming reaction is one of extreme
> > anger. Slide 8 does nothing to lessen the offence. The problem is
> > that it gives the impression that in general it is acceptable for
> > backspace to delete the whole grapheme cluster.  
> 
> Let's turn extreme anger into knowledge. 
> 
> I'm not very knowledgable in ligature heavy scripts (I suspect that's
> what you refer to) and what you describe is the first thing I went
> with for a readline editor data structure. 

Not necessarily ligature-heavy, but heavy in combining characters.
Examples at the light end include IPA and pointed Hebrew.  The Thai
script is another fairly well-known one but Siamese itself doesn't use
more than two marks on a consonant.  (The vowel marks before and after
don't count - they work like letters.)

> Would maybe care to expand when exactly you think it's not acceptable
> and what kind of tools or standard I can find the Unicode toolbox to
> implement an acceptable behaviour for backspace on general Unicode
> text. 

The compromise that has generally been reached is that 'delete' deletes
a grapheme cluster and 'backspace' deletes a scalar value.  (There are
good editors like Emacs that delete only a single character.)  The
rationale for this is that backspace undoes the effect of a
keystroke. For a perfect match, the keyboard would need to handle the
backspace - and everyone editing the text would have to use compatible
keyboards!  That's not a very plausible scenario for a Wikipedia
article.

Now, deleting the last character is not very Unicode compliant; there
is a family of keyboard designs in development that by default deletes
the last character in NFC form if it is precomposed and otherwise the
last character in NFD forms.  UTS#35 Issue 36 Part 7 Section 5.21
allows for more elaborate behaviours.  I would contend that deleting
the last character is the best simple approximation.  However, it's not
impossible for a dead key implementation to decide that dead acute plus
'e' should be emitted as two characters, even though its more usual for
it to be emitted as a single character.

Now, there are cases where one may be unlikely to type a single
character.  I can imagine a variation sequence or being implemented as
a 'ligature', i.e. a single stroke (or IME selection action) yielding
the entry of a base character plus variation selector.  Emoji may be
another, though I must say I would probably enter a regional indicator
pair as two characters, and expect to be able to delete just the last
if I made an error, contra Davis 2019.

While stacker + consonant might be expected to be a unit, the original
designs envisaged them being a sequence.  Additionally, I would expect
an edit to change the subscripted consonant rather than remove it.  In
this case, delete last character and delete grapheme cluster agree for
the language-independent rules.

Richard.