Surrogates and noncharacters

Richard Wordingham richard.wordingham at ntlworld.com
Sun May 10 15:44:29 CDT 2015


On Sun, 10 May 2015 21:19:52 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> The wy I read D77 (code unit) it is not bound to any Unicode encoding
> form;

Agreed.

> "The minimal bit combination that can represent a unit of
> encoded text for processing or interchange" can beany bit length and
> can even use non binary repreentation (not bit-based; it could be
> ternary; or floatting point, or base ten with the remaining bit
> patterns posibly used for other functions (such as clock
> synchronization!calibration, polarization balancing; lieving only
> some patterns distinctable but not necessarily an exact power of
> two...)

I don't object to that reading, but I'm not sure it's correct.

> I don't see why a 32-bit code unit or 8-bit code unit has to
> be bound to UTF-32 or UTF-8 in D77; the code unit is just a code
> unit; it does not have to be assigned any Unicode scalar value or
> exist in a specific pattern valid for UTF-32 or UTF-8 (in addition
> these two UTF's are not the only two ones supported; look as SCSU for
> example; or GB18030 which are also conforming UTF's):

D77 is definitely not bound to Unicode encoding forms - it gives
Shift-JIS as an example of an encoding that has code units.

> The code unit is just one element within an enumerable and finite set
> of elements that is transmissible to some interface and
> interchangeable.
> 
> It's up to each UTF to define how they can use them: these UTF's are
> usable on these stes provided that these sets are large nuitto
> contain at least a the number of code units required for this UTF to
> be supported (which means that the actual bitcount of the transported
> code units does not matter; this is out of scope of TUS which jsut
> requires sets with sufficient cardinality):

The critical matter is the number of array elements needed for each
scalar value and the pattern of which elements of the scalar values
have the 'same' values.

> For these reasons I absolutely do nt see why you argue that 0xFFFFFFFF
> cannot be a valid 32-bit code unit

Fair point so far.  I agree it can be a 32-bit code unit in some
character encoding.  However, it is not a UTF-32 code unit.

> and then why <0xFFFFFFFF> cant be a
> valid 32-bit string

I agree that it is a 32-bit string.  I don't know what you mean by the
word 'valid' in this context.

> (or "Unicode 32-bit string> liek TUS
renames it in
> D80-D83 in a way that is really unproductive (and in fact confusive).

I hope you now see that it cannot be Unicode 32-bit string, for
0xFFFFFFFF is not a UTF-32 code unit.  This is a key point in the
difference between:

a) x-bit string,
b) Unicode x-bit string, and
c) UTF-x string

For x=8, these are three different things.  For x=16 or x=32, these are
two different things, but they do not split the same way.

D80-D83 do not directly rename 8-bit strings, 16-bit strings or 32-bit
strings as Unicode 8-bit strings, Unicode 16-bit strings or Unicode
32-bit strings.  That all 16-bit strings are Unicode 16-bit strings
is a consequence of the definition of UTF-16. Similarly, not all 8-bit
strings being Unicode 8-bit strings and not all 32-bit strings are
consequences of the definitions of UTF-8 and UTF-32 respectively.

I agree that the concept of Unicode 8-bit strings is not useful.  The
separate concept of Unicode 32-bit strings is also not useful, for I
contend that all Unicode 32-bit strings are in fact UTF-32 strings.
The latter result is an immediate consequence of UTF-32 not being a
multi-code unit encoding.

> As well nothing prohibits supportng the UTF-32 encoding form over a
> 21-bit stream, using another "encding scheme" (which cannt be named
> also UTF-32 or UT-32BE or UTF-32LE" but could be named 'UTF-32-21":
> the result witll be a 21-bit strng; but still the 21(bit code unit
> 0x1FFFFF will still be valid.
> 
> 2015-05-10 12:23 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > On Sun, 10 May 2015 07:42:14 +0200
> > Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> >
> > I as replying out of order for greater coherence of my reply.
> >
> > > However I wonder what would be the effect of D80 in UTF-32: is
> > > <0xFFFFFFFF> a valid "32-bit string" ? After all it is also
> > > containing a single 32-bit code unit (for at least one Unicode
> > > encoding form), even if it has no "scalar value" and then does not
> > > have to validate D89 (for UTF-32)...
> >
> > The value 0xFFFFFFFF cannot appear in a UTF-32 string.  Therefore it
> > cannot represent a unit of encoded text in a UTF-32 string.  By D77
> > paragraph 1, "Code unit:  The minimal bit combination that can
> > represent a unit of encoded text for processing or interchange", it
> > is therefore not a code unit.

Correction: "is therefore not a UTF-32 code unit."

> >  The effect of D77, D80 and D83 is
> > that <0xFFFFFFFF> is a 32-bit string but not a Unicode 32-bit
> > string.
> >
> > > - D80 defines "Unicode string" but in fact it just defines a
> > > generic "string" as an arbitrary stream of fixed-size code units.
> >
> > No - see argument above.
> >
> > > These two rules [D80 and D82 - RW] are not productive at all,
> > > except for saying that all values of fixed size code units are
> > > acceptable (including for example 0xFF in 8-bit strings, which is
> > > invalid in UTF-8)

I ask again:
Do you still maintain this reading of D77?  D77 is not as clear as it
should be.

Richard.


More information about the Unicode mailing list