Surrogates and noncharacters
Richard Wordingham
richard.wordingham at ntlworld.com
Sun May 10 05:23:41 CDT 2015
On Sun, 10 May 2015 07:42:14 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:
I as replying out of order for greater coherence of my reply.
> However I wonder what would be the effect of D80 in UTF-32: is
> <0xFFFFFFFF> a valid "32-bit string" ? After all it is also
> containing a single 32-bit code unit (for at least one Unicode
> encoding form), even if it has no "scalar value" and then does not
> have to validate D89 (for UTF-32)...
The value 0xFFFFFFFF cannot appear in a UTF-32 string. Therefore it
cannot represent a unit of encoded text in a UTF-32 string. By D77
paragraph 1, "Code unit: The minimal bit combination that can
represent a unit of encoded text for processing or interchange", it is
therefore not a code unit. The effect of D77, D80 and D83 is that
<0xFFFFFFFF> is a 32-bit string but not a Unicode 32-bit string.
> - D80 defines "Unicode string" but in fact it just defines a generic
> "string" as an arbitrary stream of fixed-size code units.
No - see argument above.
> These two rules [D80 and D82 - RW] are not productive at all, except
> for saying that all values of fixed size code units are acceptable
> (including for example 0xFF in 8-bit strings, which is invalid in
> UTF-8)
Do you still maintain this reading of D77? D77 is not as clear as it
should be.
> <snip> D80 and D82 have no purpose, except adding the term "Unicode"
> redundantly to these expressions.
I have the cynical suspicion that these definitions were added to
preserve the interface definitions of routines processing UCS-2
strings when the transition to UTF-16 occurred. They can also have the
(intentional?) side-effect of making more work for UTF-8 and UTF-32
processing, because arbitrary 8-bit strings and 32-bit strings are not
Unicode strings.
Richard.
More information about the Unicode
mailing list