Deleting Lone Surrogates

Richard Wordingham richard.wordingham at ntlworld.com
Sun Oct 4 14:38:02 CDT 2015


On Sun, 4 Oct 2015 10:50:43 -0700
Markus Scherer <markus.icu at gmail.com> wrote:

> I would not spend any time specifying intricate rules for unpaired
> surrogates in 16-bit strings, or out-of range values in 32-bit
> strings. Most processing will treat them like unassigned characters,
> like U+50005, with only default behaviors.

The core problem here is that many editors will not allow one to delete
just a non-initial character from a grapheme cluster.  I fear there may
be editors that don't even allow one to delete the final character.
This may not be a problem when one works with a small set of grapheme
clusters, as in French or German, or possibly even Vietnamese, but
becomes a problem when working with such a large set that the notion of
them being user-perceived characters strains credulity.

A stray U+50005 before a combining mark would also be fiddly to get
rid of, but even if the editor does not allow the entry of arbitrary
scalar values, a user might fix the problem by creating an HTML file
containing the character and then copying the character from the HTML
file to a find and replace command.  This trick is unlikely to work for
a lone surrogate.

Richard.


More information about the Unicode mailing list