Running out of code points, redux (was: Re: Feedback on the proposal...)

Ken Whistler via Unicode unicode at unicode.org
Thu Jun 1 21:19:51 CDT 2017


On 6/1/2017 6:21 PM, Richard Wordingham via Unicode wrote:
>> By definition D39b, either sequence of bytes, if encountered by an
>> conformant UTF-8 conversion process, would be interpreted as a
>> sequence of 6 maximal subparts of an ill-formed subsequence.
> ("D39b" is a typo for "D93b".)

Sorry about that. :)

>
> Conformant with what?  There is no mandatory*requirement*  for a UTF-8
> conversion process conformant with Unicode to have any concept of
> 'maximal subpart'.

Conformant with the definition of UTF-8. I agree that nothing forces a 
conversion *process* to care anything about maximal subparts, but if 
*any* process using a conformant definition of UTF-8 then goes on to 
have any concept of "maximal subpart of an ill-formed subsequence" that 
departs from definition D93b in the Unicode Standard, then it is just 
making s**t up.

>
>> I don't see a good reason to build in special logic to treat FC 80 80
>> 80 80 80 as somehow privileged as a unit for conversion fallback,
>> simply because*if*  UTF-8 were defined as the Unix gods intended
>> (which it ain't no longer) then that sequence*could*  be interpreted
>> as an out-of-bounds scalar value (which it ain't) on spec that the
>> codespace*might*  be extended past 10FFFF at some indefinite time in
>> the future (which it won't).
> Arguably, it requires special logic to treat FC 80 80 80 80 80 as an
> invalid sequence.

That would be equally true of FF FF FF FF FF FF. Which was my point, 
actually.

>    FC is not ASCII,

True, of course. But irrelevant. Because we are talking about UTF-8 
here. And just because some non-UTF-8 character encoding happened to 
include 0xFC as a valid (or invalid) value, might not require any 
special case processing. A simple 8-bit to 8-bit conversion table could 
be completely regular in its processing of 0xFC for a conversion.

>   and has more than one leading bit
> set.  It has the six leading bits set,

True, of course.

>   and therefore should start a
> sequence of 6 characters.

That is completely false, and has nothing to do with the current 
definition of UTF-8.

The current, normative definition of UTF-8, in the Unicode Standard, and 
in ISO/IEC 10646:2014, and in RFC 3629 (which explicitly "obsoletes and 
replaces RFC 2279") states clearly that 0xFC cannot start a sequence of 
anything identifiable as UTF-8.

--Ken

>
> Richard.
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170601/0eec6e66/attachment.html>


More information about the Unicode mailing list