Unpaired surrogates

Tue Oct 20 05:06:35 CDT 2015

2015-10-20 2:07 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> Now, as we know, UTF-32 does not handle the full range of Unicode code
> points;

??? All valid UTFs handle the full range of valid Unicode code points. This
includes UTF-32 as well as UTF-16 and UTF-8 (and their variants).

it only handles scalar values.

??? UTF's allow encoding ANY valid scalar values (which are bijectively
associated to a subset of valid code points). However they don't allow
encoding surrogates (that are valid code points but not assigned any scalar
value, so that they are not valid in any valid UTF).

Visibly you are still confusing code points, code units and scalar values.

> In the discussion of UTS#18
> RL1.7, my objections did result in the addition of:
>
> "Note: It is permissible, but not required, to match an isolated
> surrogate code point (such as \u{D800}), which may occur in Unicode
> Strings. See Unicode String in the Unicode glossary."
>
> I'm not sure that that text loosely associated with RL1.7 gets round
> Requirement RL1.1, which still reads:
>
> "To meet this requirement, an implementation shall supply a mechanism
> for specifying any Unicode code point (from U+0000 to U+10FFFF), using
> the hexadecimal code point representation."
>

I'm also puzzled about how such a regexp will really match some input text
if that input text has to be using a valid UTF. The regexp "\u{D800}" will
likely match only lone surrogates (in any UTF), not a surrogate with the
same value which is paired correctly to encode a supplementary code point.

Note that even with **valid** UTF-8 text, U+D800 cannot occur. But if you
remove the "valid" restriction, U+D800 may be present, including before
U+DC00, but this won't form a valid pair: these are also lone surrogates in
this case (they are paired and encode a supplementary code point, only if
the text uses UTF-16
There are no valid surrogate pairs in valid UTF-8 and valid UTF-32, so if
surrogates are appearing, they are all "lone" surrogates. If you blindly
convert from UTF-8 or UTF-32 to UTF-16, the invalid text could become valid
and new valid supplementary code points will appear unexpectedly. That's
why lone surrogates cannot be part of any valid UTF, as they break the
bijection.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151020/aa7efefa/attachment.html>