Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

Fri May 8 23:13:33 CDT 2015

2015-05-09 5:13 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> I can't think of a practical use for the specific concepts of Unicode
> 8-bit, 16-bit and 32-bit strings.  Unicode 16-bit strings are
> essentially the same as 16-bit strings, and Unicode 32-bit strings are
> UTF-32 strings.   'Unicode 8-bit string' strikes me as an exercise in
> pedantry; there are more useful categories of 8-bit strings that are
> not UTF-8 strings.
>

And here you're wrong: a 16-bit string is just a sequence of arbitrary
16-bit code units, but an Unicode string (whatever the size of its code
units) adds restrictions for validity (the only restriction being in fact
that surrogates (when present in 16-bit strings, i.e. UTF-16) must be
paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are
forbidden.

So the concept of "Unicode string" is in fact the same as valid Unicode
text: it is a subset of possible strings, restricted by validation rules:
- for 8-bit strings (UTF-8) there are other constraints (not all bytes are
acceptable and some pairs of bytes are also restricted, and final bytes
cannot occur alone)
- for 16-bit strings (UTF-16), the only constraint is on isolated/unpaired
surrogates
- for 32-bit strings (UTF-32), the only constaint is on the two allowed
ranges of encoded code points (U+0000..U+D7FF and U+E000..U+10FFFF).

For being "plain-text" there are additional restrictions: non-characters
are also excluded, and only a small subset of controls (basically tabs and
newlines) is allowed (the other controls, including U+0000 are restricted
for private protocols and not designed for plain text... except
specifically in a few legacy encoded 8-bit "charsets" like VISCII or ISO
2022 or Videotext which need these controls in fact to represent characters
into sequences, possibly with contextual encoding).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150509/2862bab6/attachment.html>