Surrogates and noncharacters

Sun May 10 05:23:41 CDT 2015

On Sun, 10 May 2015 07:42:14 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

I as replying out of order for greater coherence of my reply.

> However I wonder what would be the effect of D80 in UTF-32: is
> <0xFFFFFFFF> a valid "32-bit string" ? After all it is also
> containing a single 32-bit code unit (for at least one Unicode
> encoding form), even if it has no "scalar value" and then does not
> have to validate D89 (for UTF-32)...

The value 0xFFFFFFFF cannot appear in a UTF-32 string.  Therefore it
cannot represent a unit of encoded text in a UTF-32 string.  By D77
paragraph 1, "Code unit:  The minimal bit combination that can
represent a unit of encoded text for processing or interchange", it is
therefore not a code unit.  The effect of D77, D80 and D83 is that
<0xFFFFFFFF> is a 32-bit string but not a Unicode 32-bit string.

> - D80 defines "Unicode string" but in fact it just defines a generic
> "string" as an arbitrary stream of fixed-size code units.

No - see argument above.

> These two rules [D80 and D82 - RW] are not productive at all, except
> for saying that all values of fixed size code units are acceptable
> (including for example 0xFF in 8-bit strings, which is invalid in
> UTF-8)

Do you still maintain this reading of D77?  D77 is not as clear as it
should be.

> <snip> D80 and D82 have no purpose, except adding the term "Unicode"
> redundantly to these expressions.

I have the cynical suspicion that these definitions were added to
preserve the interface definitions of routines processing UCS-2
strings when the transition to UTF-16 occurred.  They can also have the
(intentional?) side-effect of making more work for UTF-8 and UTF-32
processing, because arbitrary 8-bit strings and 32-bit strings are not
Unicode strings.

Richard.