Best practices for replacing UTF-8 overlongs

Ken Whistler kenwhistler at att.net
Tue Dec 20 10:59:11 CST 2016


Doug,


On 12/19/2016 6:08 PM, Doug Ewell wrote:
> I thought there was a corrigendum or other, comparatively recent 
> addition to the Standard that spelled out how replacement characters 
> are supposed to be substituted for invalid code unit sequences -- 
> something about detecting maximally long sequences. I'll look when I 
> have a chance.
>
You found the resulting text in TUS 9.0, p. 126 - 129. The origin of the 
text there about best practices for using U+FFFD  was the discussion and 
resolution of PRI #121 in August, 2008:

http://www.unicode.org/review/pr-121.html

That was discussed at UTC #116. See the minutes:

http://www.unicode.org/L2/L2008/08253.htm

There was feedback at the time advocating the 3rd option, rather than 
the 2nd one that was eventually chosen by the UTC. See:

http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt

The actual text that resulted was first published in Unicode 5.2, p. 95:

http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf

Contrast that with the text in Unicode 5.0, which had no extended 
discussion about handling conversion errors there. The Unicode 5.2 text 
was later expanded with more definitions and explanation, to what you 
see now in Unicode 9.0.

--Ken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161220/1a184836/attachment.html>


More information about the Unicode mailing list