Running out of code points, redux (was: Re: Feedback on the proposal...)
Ken Whistler via Unicode
unicode at unicode.org
Thu Jun 1 19:10:54 CDT 2017
On 6/1/2017 2:39 PM, Richard Wordingham via Unicode wrote:
> You were implicitly invited to argue that there was no need to handle
> 5 and 6 byte invalid sequences.
>
Well, working from the *current* specification:
FC 80 80 80 80 80
and
FF FF FF FF FF FF
are equal trash, uninterpretable as *anything* in UTF-8.
By definition D39b, either sequence of bytes, if encountered by an
conformant UTF-8 conversion process, would be interpreted as a sequence
of 6 maximal subparts of an ill-formed subsequence. Whatever your
particular strategy for conversion fallbacks for uninterpretable
sequences, it ought to treat either one of those trash sequences the
same, in my book.
I don't see a good reason to build in special logic to treat FC 80 80 80
80 80 as somehow privileged as a unit for conversion fallback, simply
because *if* UTF-8 were defined as the Unix gods intended (which it
ain't no longer) then that sequence *could* be interpreted as an
out-of-bounds scalar value (which it ain't) on spec that the codespace
*might* be extended past 10FFFF at some indefinite time in the future
(which it won't).
--Ken
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170601/a237218e/attachment.html>
More information about the Unicode
mailing list