Deleting Lone Surrogates

Mark Davis ☕️ mark at macchiato.com
Sun Oct 4 08:44:32 CDT 2015


When I use http://unicode.org/cldr/utility/breaks.jsp, it does show the
sequence ����� as just two grapheme clusters.

In #29 we are specifically not concerned about ill-formed text (or other
degenerate cases). I suppose it would be possible to handle isolated
surrogates in different way (eg always breaking) if it represented a common
problem, but someone would have to make a very good case for that.


Mark <https://google.com/+MarkDavis>

*— Il meglio è l’inimico del bene —*

On Sun, Oct 4, 2015 at 3:02 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> In the absence of a specific tailoring, is the combination of a lone
> surrogate and a combining mark a user-perceived character?  Does a lone
> surrogate constitute a user-perceived character?
>
> The problem I have is that because of an application-specific bug,
> when I attempt to enter the sequence <U+1148F TIRHUTA LETTER KA,
> U+114BA TIRHUTA SIGN CANDRABINDU>, I appear to be gettig the UTF-16 code
> unit sequence <D805 DC8F D805 D805 DCBA>, which is being interpreted as
> the codepoint sequence <U+1148F, U+D805, U+114BA>.
>
> (The problem seems to arise because I use a sequence of two key strokes
> to enter candrabindu, and the application or input mechanism has to undo
> the entry of a supplementary character entered in response to the first
> keystroke.  I've reported the problem as Bug 94753.)
>
> Because the lone surrogate is interpreted as the start of a
> user-perceived character, I can move the cursor to between U+1148F and
> U+D805.  Then pressing the 'delete' key (as opposed to the 'rubout'
> key) will delete the U+D805.  However, if the lone surrogate plus
> combining mark is a user-perceived character, then all I will be left
> with is <U+1148F>.  At present the offending application is treating
> Tirhuta combining marks as user-perceived characters, but I suspect the
> application has simply not caught up with Unicode Version 7 yet.
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151004/bc14b3bc/attachment.html>


More information about the Unicode mailing list