Best practices for replacing UTF-8 overlongs

Tue Dec 20 00:56:53 CST 2016

On Mon, 19 Dec 2016 20:54:31 -0700
Doug Ewell <doug at ewellic.org> wrote:

> There isn't much to be gained by collapsing the bad bytes to a single
> replacement character. However, doing so does remove the information
> about how many bytes were invalid and that may have value to a user
> in assessing how much of the document is suspect.

How many bytes are invalid in the sequence F0 30 A0 B0?  There might
just be one bit error in the data stream.

The chief advantage of collapsing comes in the simplicity of the
decoding logic.  The natural logic is to read the requisite number of
continuation bytes, converting the whole to a codepoint value, and then
check that the codepoint value is allowed in UTF-8. Obviously one also
has to check that the requisite continuation bytes are present.

Arguments then come down to the use or otherwise of library functions
and the number of error-reporting mechanisms to be used.

Richard.