Best practices for replacing UTF-8 overlongs

Mon Dec 19 17:52:36 CST 2016

Karl Williamson wrote:

> It seems counterintuitive to me that the two byte sequence C0 80
> should be replaced by 2 replacement characters under best practices,
> or that E0 80 80 should also be replaced by 2. Each sequence was legal
> in early Unicode versions,

This is overstated at best. Decoders weren't required to detect overlong
sequences until 2000, but it was never legal to generate them. This was
stated explicitly in RFC 2279 and in Unicode 1.1, Appendix F. Correct
use of the instructions and table in RFC 2044 also precluded the
creation of overlong sequences. 

--
Doug Ewell | Thornton, CO, US | ewellic.org