Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)
markus.icu at gmail.com
Fri May 8 23:37:40 CDT 2015
On Fri, May 8, 2015 at 9:13 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 2015-05-09 5:13 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
>> I can't think of a practical use for the specific concepts of Unicode
>> 8-bit, 16-bit and 32-bit strings. Unicode 16-bit strings are
>> essentially the same as 16-bit strings, and Unicode 32-bit strings are
>> UTF-32 strings. 'Unicode 8-bit string' strikes me as an exercise in
>> pedantry; there are more useful categories of 8-bit strings that are
>> not UTF-8 strings.
> And here you're wrong: a 16-bit string is just a sequence of arbitrary
> 16-bit code units, but an Unicode string (whatever the size of its code
> units) adds restrictions for validity (the only restriction being in fact
> that surrogates (when present in 16-bit strings, i.e. UTF-16) must be
> paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are
No, Richard had it right. See for example definition D82 "Unicode 16-bit
string" in the standard. (Section 3.9 Unicode Encoding Forms,
I agree that the definitions for Unicode 8-bit and 32-bit strings are not
For being "plain-text" there are additional restrictions: non-characters
> are also excluded, and only a small subset of controls (basically tabs and
> newlines) is allowed (the other controls, including U+0000 are restricted
> for private protocols and not designed for plain text... except
> specifically in a few legacy encoded 8-bit "charsets" like VISCII or ISO
> 2022 or Videotext which need these controls in fact to represent characters
> into sequences, possibly with contextual encoding).
Where did you find that definition of "plain text"?
Unicode just defines "plain text" by contrast with "rich text" which is
text with markup or other such structure. There is no limitation of code
points associated with that term.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode