Running out of code points, redux (was: Re: Feedback on the proposal...)

Richard Wordingham via Unicode unicode at unicode.org
Thu Jun 1 20:45:29 CDT 2017


On Thu, 1 Jun 2017 17:10:54 -0700
Ken Whistler via Unicode <unicode at unicode.org> wrote:

> Well, working from the *current* specification:
> 
> FC 80 80 80 80 80
> and
> FF FF FF FF FF FF
> 
> are equal trash, uninterpretable as *anything* in UTF-8.
> 
> By definition D39b, either sequence of bytes, if encountered by an 
> conformant UTF-8 conversion process, would be interpreted as a
> sequence of 6 maximal subparts of an ill-formed subsequence.

There is a very good argument that 0xFC and 0xFF are not code units
(D77) - they are not used in the representation of any Unicode scalar
values.  By that argument, you have 5 maximal subparts and seven
garbage bytes.

Richard.


More information about the Unicode mailing list