Best practices for replacing UTF-8 overlongs

Ken Whistler kenwhistler at
Tue Dec 20 10:59:11 CST 2016


On 12/19/2016 6:08 PM, Doug Ewell wrote:
> I thought there was a corrigendum or other, comparatively recent 
> addition to the Standard that spelled out how replacement characters 
> are supposed to be substituted for invalid code unit sequences -- 
> something about detecting maximally long sequences. I'll look when I 
> have a chance.
You found the resulting text in TUS 9.0, p. 126 - 129. The origin of the 
text there about best practices for using U+FFFD  was the discussion and 
resolution of PRI #121 in August, 2008:

That was discussed at UTC #116. See the minutes:

There was feedback at the time advocating the 3rd option, rather than 
the 2nd one that was eventually chosen by the UTC. See:

The actual text that resulted was first published in Unicode 5.2, p. 95:

Contrast that with the text in Unicode 5.0, which had no extended 
discussion about handling conversion errors there. The Unicode 5.2 text 
was later expanded with more definitions and explanation, to what you 
see now in Unicode 9.0.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list