Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Shawn Steele via Unicode unicode at unicode.org
Mon May 15 15:05:55 CDT 2017


>> Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting multiple errors there makes no sense.
> 
> Changing a specification as fundamental as this is something that should not be undertaken lightly.

IMO, the only think that can be agreed upon is that "something's bad with this UTF-8 data".  I think that whether it's treated as a single group of corrupt bytes or each individual byte is considered a problem should be up to the implementation.

#1 - This data should "never happen".  In a system behaving normally, this condition should never be encountered.  
  * At this point the data is "bad" and all bets are off.
  * Some applications may have a clue how the bad data could have happened and want to do something in particular.
  * It seems odd to me to spend much effort standardizing a scenario that should be impossible.
#2 - Depending on implementation, either behavior, or some combination, may be more efficient.  I'd rather allow apps to optimize for the common case, not the case-that-shouldn't-ever-happen
#3 - We have no clue if this "maximal" sequence was a single error, 2 errors, or even more.  The lead byte says how many trail bytes should follow, and those should be in a certain range.  Values outside of those conditions are illegal, so we shouldn't ever encounter them.  So if we did, then something really weird happened.  
  * Did a single character get misencoded?
  * Was an illegal sequence illegally encoded?
  * Perhaps a byte got corrupted in transmission?
  * Maybe we dropped a packet/block, so this is really the beginning of a valid sequence and the tail of another completely valid sequence?

In practice, all that most apps would be able to do would be to say "You have bad data, how bad I have no clue, but it's not right".  A single bit could've flipped, or you could have only 3 pages of a 4000 page document.  No clue at all.  At that point it doesn't really matter how many FFFD's the error(s) are replaced with, and no assumptions should be made about the severity of the error.

-Shawn



More information about the Unicode mailing list