Running out of code points, redux (was: Re: Feedback on the proposal...)
Richard Wordingham via Unicode
unicode at unicode.org
Thu Jun 1 22:32:35 CDT 2017
On Thu, 1 Jun 2017 19:19:51 -0700
Ken Whistler via Unicode <unicode at unicode.org> wrote:
> > and therefore should start a
> > sequence of 6 characters.
>
> That is completely false, and has nothing to do with the current
> definition of UTF-8.
>
> The current, normative definition of UTF-8, in the Unicode Standard,
> and in ISO/IEC 10646:2014, and in RFC 3629 (which explicitly
> "obsoletes and replaces RFC 2279") states clearly that 0xFC cannot
> start a sequence of anything identifiable as UTF-8.
TUS Section 3 is like the Augean Stables. It is a complete mess as a
standards document, imputing mental states to computing processes.
Table 3-7 for example, should be a consequence of a 'definition' that
UTF-8 only represents Unicode Scalar values and excludes 'non-shortest
forms'. Instead, the exclusion of the sequence <ED A0 80> is presented
as a brute definition, rather than as a consequence of 0xD800 not being
a Unicode scalar value. Likewise, 0xFC fails to be legal because it
would define either a 'non-shortest form' or a value that is not a
Unicode scalar value.
The differences are a matter of presentation; the outcome as to what is
permitted is the same. The difference lies rather in whether the rules
are comprehensible. A comprehensible definition is more likely to be
implemented correctly. Where the presentation makes a difference is in
how malformed sequences are naturally handled.
Richard.
More information about the Unicode
mailing list