Deleting Lone Surrogates
richard.wordingham at ntlworld.com
Mon Oct 5 14:58:48 CDT 2015
On Mon, 5 Oct 2015 16:51:25 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 2015-10-05 13:50 GMT+02:00 Martin J. Dürst <duerst at it.aoyama.ac.jp>:
> > In an editing tool (of which an editing interface is a part of), a
> > lone surrogate should just be removed! Apparently, that's what
> > happens in Richard's case, but only eventually.
> Not silently ! Even if this removal is required to go on editing,
> this must be notified to the user as it may occur in unedited parts
> of the file (and it may be the sign that the document is not fully
> plain text, so the user should not save the edited file)
> If this is caused by a quirk in the user input (defect of the input
> mode or keyboard layout), there should be a notification.
The lone surrogates (as I surmise) in this case are caused by the user
input being misinterpreted. The sequence of strings delivered to a
program running X receiving the same sequence of keystrokes is U+1148F,
U+114C0, U+0008, U+114BF, and I have no reason to doubt that the
offending program is receiving the same sequence. My working
hypothesis is that this is being simplified to U+1148F, U+D805,
U+114BF; the presence of U+D805 is a program error. I can reproduce
the problem in a previously empty file.
Now, on Windows, old MS keyboards at least deliver supplementary
characters in a pair of WM_CHAR messages. If one of these ligatures
were corrupted so that only the first of the messages was delivered, it
is not obvious to me how a program would readily detect the omission.
It would only become obvious when the start of the next *character* was
More information about the Unicode