Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Shawn Steele via Unicode unicode at
Tue May 16 12:30:01 CDT 2017

> Would you advocate replacing

>   e0 80 80

> with

>   U+FFFD U+FFFD U+FFFD     (1)

> rather than

>   U+FFFD                   (2)

> It’s pretty clear what the intent of the encoder was there, I’d say, and while we certainly don’t 
> want to decode it as a NUL (that was the source of previous security bugs, as I recall), I also don’t
> see the logic in insisting that it must be decoded to *three* code points when it clearly only 
> represented one in the input.

It is not at all clear what the intent of the encoder was - or even if it's not just a problem with the data stream.  E0 80 80 is not permitted, it's garbage.  An encoder can't "intend" it.

A) the "encoder" was attempting to be malicious, in which case the whole thing is suspect and garbage, and so the # of FFFD's doesn't matter, or

B) the "encoder" is completely broken, in which case all bets are off, again, specifying the # of FFFD's is irrelevant.

C) The data was corrupted by some other means.  Perhaps bad concatenations, lost blocks during read/transmission, etc.  If we lost 2 512 byte blocks, then maybe we should have a thousand FFFDs (but how would we known?)


