Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Tue May 16 12:58:22 CDT 2017

On Tue, 16 May 2017 17:30:01 +0000
Shawn Steele via Unicode <unicode at unicode.org> wrote:

> > Would you advocate replacing  
> 
> >   e0 80 80  
> 
> > with  
> 
> >   U+FFFD U+FFFD U+FFFD     (1)  
> 
> > rather than  
> 
> >   U+FFFD                   (2)  
> 
> > It’s pretty clear what the intent of the encoder was there, I’d
> > say, and while we certainly don’t want to decode it as a NUL (that
> > was the source of previous security bugs, as I recall), I also
> > don’t see the logic in insisting that it must be decoded to *three*
> > code points when it clearly only represented one in the input.  
> 
> It is not at all clear what the intent of the encoder was - or even
> if it's not just a problem with the data stream.  E0 80 80 is not
> permitted, it's garbage.  An encoder can't "intend" it.

It was once a legal way of encoding NUL, just like C0 E0, which is
still in use, and seems to be the best way of storing NUL as character
content in a *C string*.  (Strictly speaking, one can't do it.)  It
could be lurking in old text or come from an old program that somehow
doesn't get used for U+0080 to U+07FF. Converting everything in UCS-2
to 3 bytes was an easily encoded way of converting UTF-16 to UTF-8.

Remember the conformance test for the Unicode Collation Algorithm has
contained lone surrogates in the past, and the UAX on Unicode Regular
Expressions used to require the ability to search for lone surrogates.

Richard.