Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Shawn Steele via Unicode unicode at
Wed May 31 12:43:08 CDT 2017

> > In either case, the bad characters are garbage, so neither approach is 
> > "better" - except that one or the other may be more conducive to the 
> > requirements of the particular API/application.

> There's a potential issue with input methods that indirectly edit the backing store.  For example,
> GTK input methods (e.g. function gtk_im_context_delete_surrounding()) can delete an amount 
> of text specified in characters, not storage units.  (Deletion by storage units is not available in this
> interface.)  This might cause utter confusion or worse if the backing store starts out corrupt. 
> A corrupt backing store is normally manually correctable if most of the text is ASCII.

I think that's sort of what I said: some approaches might work better for some systems and another approach might work better for another system.  This also presupposes a corrupt store.

It is unclear to me what the expected behavior would be for this corruption if, for example, there were merely a half dozen 0x80 in the middle of ASCII text?  Is that garbage a single "character"?  Perhaps because it's a consecutive string of bad bytes?  Or should it be 6 characters since they're nonsense?  Or maybe 2 characters because the maximum # of trail bytes we can have is 3?

What if it were 2 consecutive 2-byte sequence lead bytes and no trail bytes?

I can see how different implementations might be able to come up with "rules" that would help them navigate (or clean up) those minefields, however it is not at all clear to me that there is a "best practice" for those situations.

There also appears to be a special weight given to non-minimally-encoded sequences.  It would seem to me that none of these illegal sequences should appear in practice, so we have either:

* A bad encoder spewing out garbage (overlong sequences)
* Flipped bit(s) due to storage/transmission/whatever errors
* Lost byte(s) due to storage/transmission/coding/whatever errors
* Extra byte(s) due to whatever errors
* Bad string manipulation breaking/concatenating in the middle of sequences, causing garbage (perhaps one of the above 2 codeing errors).

Only in the first case, of a bad encoder, are the overlong sequences actually "real".  And that shouldn't happen (it's a bad encoder after all).  The other scenarios seem just as likely, (or, IMO, much more likely) than a badly designed encoder creating overlong sequences that appear to fit the UTF-8 pattern but aren't actually UTF-8.

The other cases are going to cause byte patterns that are less "obvious" about how they should be navigated for various applications.

I do not understand the energy being invested in a case that shouldn't happen, especially in a case that is a subset of all the other bad cases that could happen.


More information about the Unicode mailing list