Running out of code points, redux (was: Re: Feedback on the proposal...)
Ken Whistler via Unicode
unicode at unicode.org
Thu Jun 1 21:19:51 CDT 2017
On 6/1/2017 6:21 PM, Richard Wordingham via Unicode wrote:
>> By definition D39b, either sequence of bytes, if encountered by an
>> conformant UTF-8 conversion process, would be interpreted as a
>> sequence of 6 maximal subparts of an ill-formed subsequence.
> ("D39b" is a typo for "D93b".)
Sorry about that. :)
> Conformant with what? There is no mandatory*requirement* for a UTF-8
> conversion process conformant with Unicode to have any concept of
> 'maximal subpart'.
Conformant with the definition of UTF-8. I agree that nothing forces a
conversion *process* to care anything about maximal subparts, but if
*any* process using a conformant definition of UTF-8 then goes on to
have any concept of "maximal subpart of an ill-formed subsequence" that
departs from definition D93b in the Unicode Standard, then it is just
making s**t up.
>> I don't see a good reason to build in special logic to treat FC 80 80
>> 80 80 80 as somehow privileged as a unit for conversion fallback,
>> simply because*if* UTF-8 were defined as the Unix gods intended
>> (which it ain't no longer) then that sequence*could* be interpreted
>> as an out-of-bounds scalar value (which it ain't) on spec that the
>> codespace*might* be extended past 10FFFF at some indefinite time in
>> the future (which it won't).
> Arguably, it requires special logic to treat FC 80 80 80 80 80 as an
> invalid sequence.
That would be equally true of FF FF FF FF FF FF. Which was my point,
> FC is not ASCII,
True, of course. But irrelevant. Because we are talking about UTF-8
here. And just because some non-UTF-8 character encoding happened to
include 0xFC as a valid (or invalid) value, might not require any
special case processing. A simple 8-bit to 8-bit conversion table could
be completely regular in its processing of 0xFC for a conversion.
> and has more than one leading bit
> set. It has the six leading bits set,
True, of course.
> and therefore should start a
> sequence of 6 characters.
That is completely false, and has nothing to do with the current
definition of UTF-8.
The current, normative definition of UTF-8, in the Unicode Standard, and
in ISO/IEC 10646:2014, and in RFC 3629 (which explicitly "obsoletes and
replaces RFC 2279") states clearly that 0xFC cannot start a sequence of
anything identifiable as UTF-8.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode