Best practices for replacing UTF-8 overlongs

Mon Dec 19 19:23:25 CST 2016

If there is a short sequence of invalid bytes presumed to be one character, then one vs several replacement characters may not matter. But if it were a longer sequence that might have been several invalidly coded characters, then multiple replacement characters would give a more correct representation of the amount of information that was removed or miscoded.

There isn't much to be gained by collapsing the bad bytes to a single replacement character. However, doing so does remove the information about how many bytes were invalid and that may have value to a user in assessing how much of the document is suspect.

tex

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Shawn Steele
Sent: Monday, December 19, 2016 4:26 PM
To: Doug Ewell; Unicode Mailing List
Cc: Karl Williamson
Subject: RE: Best practices for replacing UTF-8 overlongs

IMO, the first byte of the 2 byte sequence is illegal.  So replace it with a single replacement character (hey, I ran into something unintelligible), and move on.  Then you encounter a trail byte without a lead byte, so again, it's an illegal byte, so replace it with a single replacement character.

So you end up with two.

Replacing the sequence with a single byte implies some perceived understanding of an intended structure that actually doesn't exist.

I'm curious though what the practical difference would be?  If I encountered junk like that in the middle of a string, my string is going to be disrupted by an unexpected replacement character.  At that point it's already mangled, so does it really matter if there're two instead of one?

-Shawn

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell
Sent: Monday, December 19, 2016 3:53 PM
To: Unicode Mailing List <unicode at unicode.org>
Cc: Karl Williamson <public at khwilliamson.com>
Subject: Re: Best practices for replacing UTF-8 overlongs

Karl Williamson wrote:

> It seems counterintuitive to me that the two byte sequence C0 80 
> should be replaced by 2 replacement characters under best practices, 
> or that E0 80 80 should also be replaced by 2. Each sequence was legal 
> in early Unicode versions,

This is overstated at best. Decoders weren't required to detect overlong sequences until 2000, but it was never legal to generate them. This was stated explicitly in RFC 2279 and in Unicode 1.1, Appendix F. Correct use of the instructions and table in RFC 2044 also precluded the creation of overlong sequences. 

--
Doug Ewell | Thornton, CO, US | ewellic.org