Best practices for replacing UTF-8 overlongs

Doug Ewell doug at ewellic.org
Mon Dec 19 17:52:36 CST 2016


Karl Williamson wrote:

> It seems counterintuitive to me that the two byte sequence C0 80
> should be replaced by 2 replacement characters under best practices,
> or that E0 80 80 should also be replaced by 2. Each sequence was legal
> in early Unicode versions,

This is overstated at best. Decoders weren't required to detect overlong
sequences until 2000, but it was never legal to generate them. This was
stated explicitly in RFC 2279 and in Unicode 1.1, Appendix F. Correct
use of the instructions and table in RFC 2044 also precluded the
creation of overlong sequences. 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org



More information about the Unicode mailing list