Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji)
Richard Wordingham via Unicode
unicode at unicode.org
Tue Oct 22 15:44:10 CDT 2019
On Tue, 22 Oct 2019 11:04:01 +0200
Daniel Bünzli via Unicode <unicode at unicode.org> wrote:
> On 22 October 2019 at 09:37:22, Richard Wordingham via Unicode
> (unicode at unicode.org) wrote:
> > When it comes to the second sentence of the text of Slide 7
> > 'Grapheme Clusters', my overwhelming reaction is one of extreme
> > anger. Slide 8 does nothing to lessen the offence. The problem is
> > that it gives the impression that in general it is acceptable for
> > backspace to delete the whole grapheme cluster.
> Let's turn extreme anger into knowledge.
> I'm not very knowledgable in ligature heavy scripts (I suspect that's
> what you refer to) and what you describe is the first thing I went
> with for a readline editor data structure.
Not necessarily ligature-heavy, but heavy in combining characters.
Examples at the light end include IPA and pointed Hebrew. The Thai
script is another fairly well-known one but Siamese itself doesn't use
more than two marks on a consonant. (The vowel marks before and after
don't count - they work like letters.)
> Would maybe care to expand when exactly you think it's not acceptable
> and what kind of tools or standard I can find the Unicode toolbox to
> implement an acceptable behaviour for backspace on general Unicode
The compromise that has generally been reached is that 'delete' deletes
a grapheme cluster and 'backspace' deletes a scalar value. (There are
good editors like Emacs that delete only a single character.) The
rationale for this is that backspace undoes the effect of a
keystroke. For a perfect match, the keyboard would need to handle the
backspace - and everyone editing the text would have to use compatible
keyboards! That's not a very plausible scenario for a Wikipedia
Now, deleting the last character is not very Unicode compliant; there
is a family of keyboard designs in development that by default deletes
the last character in NFC form if it is precomposed and otherwise the
last character in NFD forms. UTS#35 Issue 36 Part 7 Section 5.21
allows for more elaborate behaviours. I would contend that deleting
the last character is the best simple approximation. However, it's not
impossible for a dead key implementation to decide that dead acute plus
'e' should be emitted as two characters, even though its more usual for
it to be emitted as a single character.
Now, there are cases where one may be unlikely to type a single
character. I can imagine a variation sequence or being implemented as
a 'ligature', i.e. a single stroke (or IME selection action) yielding
the entry of a base character plus variation selector. Emoji may be
another, though I must say I would probably enter a regional indicator
pair as two characters, and expect to be able to delete just the last
if I made an error, contra Davis 2019.
While stacker + consonant might be expected to be a unit, the original
designs envisaged them being a sequence. Additionally, I would expect
an edit to change the subscripted consonant rather than remove it. In
this case, delete last character and delete grapheme cluster agree for
the language-independent rules.
More information about the Unicode