Running out of code points, redux (was: Re: Feedback on the proposal...)

Thu Jun 1 22:32:35 CDT 2017

On Thu, 1 Jun 2017 19:19:51 -0700
Ken Whistler via Unicode <unicode at unicode.org> wrote:

> >   and therefore should start a
> > sequence of 6 characters.  
> 
> That is completely false, and has nothing to do with the current 
> definition of UTF-8.
> 
> The current, normative definition of UTF-8, in the Unicode Standard,
> and in ISO/IEC 10646:2014, and in RFC 3629 (which explicitly
> "obsoletes and replaces RFC 2279") states clearly that 0xFC cannot
> start a sequence of anything identifiable as UTF-8.

TUS Section 3 is like the Augean Stables.  It is a complete mess as a
standards document, imputing mental states to computing processes.

Table 3-7 for example, should be a consequence of a 'definition' that
UTF-8 only represents Unicode Scalar values and excludes 'non-shortest
forms'. Instead, the exclusion of the sequence <ED A0 80> is presented
as a brute definition, rather than as a consequence of 0xD800 not being
a Unicode scalar value. Likewise, 0xFC fails to be legal because it
would define either a 'non-shortest form' or a value that is not a
Unicode scalar value.

The differences are a matter of presentation; the outcome as to what is
permitted is the same.  The difference lies rather in whether the rules
are comprehensible.  A comprehensible definition is more likely to be
implemented correctly.  Where the presentation makes a difference is in
how malformed sequences are naturally handled.

Richard.