Running out of code points, redux (was: Re: Feedback on the proposal...)

Thu Jun 1 19:10:54 CDT 2017

On 6/1/2017 2:39 PM, Richard Wordingham via Unicode wrote:
> You were implicitly invited to argue that there was no need to handle
> 5 and 6 byte invalid sequences.
>

Well, working from the *current* specification:

FC 80 80 80 80 80
and
FF FF FF FF FF FF

are equal trash, uninterpretable as *anything* in UTF-8.

By definition D39b, either sequence of bytes, if encountered by an 
conformant UTF-8 conversion process, would be interpreted as a sequence 
of 6 maximal subparts of an ill-formed subsequence. Whatever your 
particular strategy for conversion fallbacks for uninterpretable 
sequences, it ought to treat either one of those trash sequences the 
same, in my book.

I don't see a good reason to build in special logic to treat FC 80 80 80 
80 80 as somehow privileged as a unit for conversion fallback, simply 
because *if* UTF-8 were defined as the Unix gods intended (which it 
ain't no longer) then that sequence *could* be interpreted as an 
out-of-bounds scalar value (which it ain't) on spec that the codespace 
*might* be extended past 10FFFF at some indefinite time in the future 
(which it won't).

--Ken
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170601/a237218e/attachment.html>