Deleting Lone Surrogates
richard.wordingham at ntlworld.com
Sun Oct 4 14:30:25 CDT 2015
On Sun, 4 Oct 2015 15:44:32 +0200
Mark Davis ☕️ <mark at macchiato.com> wrote:
> When I use http://unicode.org/cldr/utility/breaks.jsp, it does show
> the sequence � as just two grapheme clusters.
But that's the sequence <U+1148F, U+FFFD, U+114BA>, which has no lone
surrogates at all! (I had to look at the raw email file to be sure of
what the text was - my email client displays U+FFFD and malformed
alleged UTF-8 the same.) I believe I would have a good chance of
repairing that by replacing U+FFFD by nothing.
It's not even certain that the substitution to replace U+FFFD would
work. With a more fully supported script in LibreOffice, I would have to
switch 'CTL diacritic' matching off and hope that substitution replaced
the shortest match. That currently works for replacing one Thai
consonant by another. To systematically replace a non-spacing Thai
character by another, I have to resort to 'regular expression'
search and replace. I must hope that they never choose to interpret
the search as matching extended grapheme clusters.
Do all Unicode character properties extend to all codepoints? If not,
how does one tell which do and which don't? If the Unicode
segmentation algorithms do apply to sequences of codepoints, as
opposed to merely to Unicode strings, then indeed <U+D805, U+114BA> is
a legacy grapheme cluster. It's an extremely unhelpful one!
> In #29 we are specifically not concerned about ill-formed text (or
> other degenerate cases). I suppose it would be possible to handle
> isolated surrogates in different way (eg always breaking) if it
> represented a common problem, but someone would have to make a very
> good case for that.
I suppose the argument will go that by using rare scripts or obsolete
characters, one deserves all the problems that one gets. The only
widely used script where one is likely to encounter lone surrogates is
CJK, and they are less of a problem there. Ideally, one shouldn't get
isolated surrogates, but when one does, the mechanisms intended to
prevent them occurring can make dealing with them difficult.
More information about the Unicode