Deleting Lone Surrogates

Richard Wordingham richard.wordingham at
Sun Oct 4 18:14:13 CDT 2015

On Sun, 4 Oct 2015 14:29:16 -0700
"Asmus Freytag (t)" <asmus-inc at> wrote:

> On 10/4/2015 12:38 PM, Richard Wordingham wrote:

> The problem you are trying to solve is to allow editing on
> the code point level, or, if you will, the keystroke level.

> Generally, there will be a sweet spot for each language (and each
> user) with respect to what to erase or undo.

> For sequences that belong to a given language, you can pick the
> behavior that makes most sense in them, but for lone surrogates, by
> definition you are dealing with broken text that doesn't follow any
> conventions.

Who's 'you'?  Customisation is frequently not available.  In fact, I
don't recall seeing it on offer.

> It should also be something that doesn't occur commonly. So, for all
> of those reasons, I see no particular problem with giving that a
> "generic" behavior, which could be that of deleting the entire
> combining sequence; especially if your interface normally deletes
> sequences as a unit.

> But in any case, the minimal requirement on an editor is that it lets
> you delete (and then retype) enough text to get it back to an
> uncorrupted state.

In the problem I hit, I would nearly be left with two options - never
having CANDRABINDU and always having it preceded by CANDRABINDU.
Whenever I enter CANDRABINDU, it is preceded by the lone surrogate.
Consequently, the option of retyping the sequence is of no avail.
Fortunately, in the application where I met the problem, the lone
surrogates, and nothing else, get deleted when the file is saved. The
problem could very easily be a lot worse.


> Catch-22 here. In filtering input to the dialog to prevent it from
> being used to corrupt text, you prevent it from being used to repair
> text. Interesting.

Not very different to having a very roll-stable aeroplane. If you ever
do end up upside-down, you have a big problem. 


More information about the Unicode mailing list