Running out of code points, redux (was: Re: Feedback on the proposal...)

Richard Wordingham via Unicode unicode at unicode.org
Thu Jun 1 20:21:55 CDT 2017


On Thu, 1 Jun 2017 17:10:54 -0700
Ken Whistler via Unicode <unicode at unicode.org> wrote:

> On 6/1/2017 2:39 PM, Richard Wordingham via Unicode wrote:
> > You were implicitly invited to argue that there was no need to
> > handle 5 and 6 byte invalid sequences.
> >  
> 
> Well, working from the *current* specification:
> 
> FC 80 80 80 80 80
> and
> FF FF FF FF FF FF
> 
> are equal trash, uninterpretable as *anything* in UTF-8.
> 
> By definition D39b, either sequence of bytes, if encountered by an 
> conformant UTF-8 conversion process, would be interpreted as a
> sequence of 6 maximal subparts of an ill-formed subsequence.

("D39b" is a typo for "D93b".)

Conformant with what?  There is no mandatory *requirement* for a UTF-8
conversion process conformant with Unicode to have any concept of
'maximal subpart'.

> I don't see a good reason to build in special logic to treat FC 80 80
> 80 80 80 as somehow privileged as a unit for conversion fallback,
> simply because *if* UTF-8 were defined as the Unix gods intended
> (which it ain't no longer) then that sequence *could* be interpreted
> as an out-of-bounds scalar value (which it ain't) on spec that the
> codespace *might* be extended past 10FFFF at some indefinite time in
> the future (which it won't).

Arguably, it requires special logic to treat FC 80 80 80 80 80 as an
invalid sequence.  FC is not ASCII, and has more than one leading bit
set.  It has the six leading bits set, and therefore should start a
sequence of 6 characters.

Richard.


More information about the Unicode mailing list