Deleting Lone Surrogates

Richard Wordingham richard.wordingham at
Sun Oct 4 19:24:40 CDT 2015

On Sun, 4 Oct 2015 15:34:13 -0700
"Asmus Freytag (t)" <asmus-inc at> wrote:

> On 10/4/2015 2:35 PM, Richard Wordingham wrote:

>> I'd much prefer to be able to delete the first character of a
>> grapheme
>> cluster.  It's annoying to have to retype 4 characters because one's
>> mistyped the first of the 4 characters in a grapheme cluster.
>> Removing the restriction would be much more useful.

> That makes sense for common typos, less so, for uncommon (hopefully)
> data corruption.

Allowing access within the cluster is generally useful.  Providing more
access just makes it easier to repair things.  One problem is that
there isn't a 'suspend shaping' option to allow one to see what one is
doing.  This matters when canonical combining classes are not available
to sort out the ordering of components.

> For some languages, you'll be typing several keystrokes, even if it's
> a single code point; there seems to be limited desire to allow you to
> "edit" the keystrokes.

The creators of the application do not know how many keystrokes were
used.  A multi-platform application is not likely to take note of what
keys were pressed even when this information is available.

> For other languages I would expect a UI design
> to cater to what local custom prefers.

Local custom?

'Local custom' is usually one of the following:

a) pen and ink, possibly with scraper.

b) typewriter and tippex

c) Hacked ASCII (and similar)

Only with complex ligatures would you not have access to each

The only parallels to what happens now that I can think of that might
count as 'custom' are:

1) European 8-bit codes, where letter plus diacritic is treated as a

2) Korean, where one couldn't chop and change the individual jamo.

3) Thai, where a tone mark can severely restrict what scraping can do.

A UI design might respond to loud enough howls of user protest.  You
may recall Thai howls of protest when the ability to independently
delete preposed vowels was lost.  Thai may have some complex vowel
symbols, but as far as the grapheme clusters go, *Thai* doesn't get more
complicated than CVT (consonant, vowel (just one!) and tone).  Some of
the minority languages in the Thai script might be a bit more

I do recall SIL's split cursor, which attempted to address the
difficulties of navigating through a stack of diacritics.  I miss it,
even though I never got to grips with all its subtleties.

What I believe is much more the case is that Unicode encourages 'one
size fits all'.  There are massive *translation* efforts for user
interfaces.  As to other parts of the text input/output, they are
usually separate from the applications.  The keyboard is almost totally
independent of the application.  Fonts are restricted to attempts to
provide adequate coverage, but the ideal is that the user provides his
own.  I think the LibreOffice search and replace interface says a lot.
It has visible support for Japanese - they holler and may well add
their own support into the core project - and there are some CTL
options which make best sense from the point of view of the Arabic
script.  The limitations on editing are one of the few places where the
UI is under the tight control of the programmers.  By and large, they
seem to be influenced by a few sources, such as the Unicode technical

Refutation awaited.

Now an attitude of 'one size fits all' does get things done.  It might
be a bit rough, but it's a lot better than nothing.


More information about the Unicode mailing list