Best practices for replacing UTF-8 overlongs

Karl Williamson public at khwilliamson.com
Mon Dec 19 17:04:06 CST 2016


It seems counterintuitive to me that the two byte sequence C0 80 should 
be replaced by 2 replacement characters under best practices, or that E0 
80 80 should also be replaced by 2.  Each sequence was legal in early 
Unicode versions, and it seems that it would be best to treat them as 
each a single sequence, replacing by a single replacement character.

What are the advantages to replacing them by multiple characters


More information about the Unicode mailing list