Deleting Lone Surrogates

Richard Wordingham richard.wordingham at ntlworld.com
Sun Oct 4 08:02:01 CDT 2015


In the absence of a specific tailoring, is the combination of a lone
surrogate and a combining mark a user-perceived character?  Does a lone
surrogate constitute a user-perceived character?

The problem I have is that because of an application-specific bug,
when I attempt to enter the sequence <U+1148F TIRHUTA LETTER KA,
U+114BA TIRHUTA SIGN CANDRABINDU>, I appear to be gettig the UTF-16 code
unit sequence <D805 DC8F D805 D805 DCBA>, which is being interpreted as
the codepoint sequence <U+1148F, U+D805, U+114BA>.

(The problem seems to arise because I use a sequence of two key strokes
to enter candrabindu, and the application or input mechanism has to undo
the entry of a supplementary character entered in response to the first
keystroke.  I've reported the problem as Bug 94753.)

Because the lone surrogate is interpreted as the start of a
user-perceived character, I can move the cursor to between U+1148F and
U+D805.  Then pressing the 'delete' key (as opposed to the 'rubout'
key) will delete the U+D805.  However, if the lone surrogate plus
combining mark is a user-perceived character, then all I will be left
with is <U+1148F>.  At present the offending application is treating
Tirhuta combining marks as user-perceived characters, but I suspect the
application has simply not caught up with Unicode Version 7 yet.

Richard.


More information about the Unicode mailing list