Best practices for replacing UTF-8 overlongs
Karl Williamson
public at khwilliamson.com
Mon Dec 19 17:04:06 CST 2016
It seems counterintuitive to me that the two byte sequence C0 80 should
be replaced by 2 replacement characters under best practices, or that E0
80 80 should also be replaced by 2. Each sequence was legal in early
Unicode versions, and it seems that it would be best to treat them as
each a single sequence, replacing by a single replacement character.
What are the advantages to replacing them by multiple characters
More information about the Unicode
mailing list