Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Tue May 16 13:09:32 CDT 2017

Regardless, it's not legal and hasn't been legal for quite some time.  Replacing a hacked embedded "null" with FFFD is going to be pretty breaking to anything depending on that fake-null, so one or three isn't really going to matter.

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham via Unicode
Sent: Tuesday, May 16, 2017 10:58 AM
To: unicode at unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On Tue, 16 May 2017 17:30:01 +0000
Shawn Steele via Unicode <unicode at unicode.org> wrote:

> > Would you advocate replacing
> 
> >   e0 80 80
> 
> > with
> 
> >   U+FFFD U+FFFD U+FFFD     (1)  
> 
> > rather than
> 
> >   U+FFFD                   (2)  
> 
> > It’s pretty clear what the intent of the encoder was there, I’d say, 
> > and while we certainly don’t want to decode it as a NUL (that was 
> > the source of previous security bugs, as I recall), I also don’t see 
> > the logic in insisting that it must be decoded to *three* code 
> > points when it clearly only represented one in the input.
> 
> It is not at all clear what the intent of the encoder was - or even if 
> it's not just a problem with the data stream.  E0 80 80 is not 
> permitted, it's garbage.  An encoder can't "intend" it.

It was once a legal way of encoding NUL, just like C0 E0, which is still in use, and seems to be the best way of storing NUL as character content in a *C string*.  (Strictly speaking, one can't do it.)  It could be lurking in old text or come from an old program that somehow doesn't get used for U+0080 to U+07FF. Converting everything in UCS-2 to 3 bytes was an easily encoded way of converting UTF-16 to UTF-8.

Remember the conformance test for the Unicode Collation Algorithm has contained lone surrogates in the past, and the UAX on Unicode Regular Expressions used to require the ability to search for lone surrogates.

Richard.