Best practices for replacing UTF-8 overlongs
richard.wordingham at ntlworld.com
Mon Dec 19 18:43:15 CST 2016
On Mon, 19 Dec 2016 16:04:06 -0700
Karl Williamson <public at khwilliamson.com> wrote:
> What are the advantages to replacing them by multiple characters
Presumably it just provides more pain for those who code using UTF-8 as
opposed to UTF-16, just like the *former* requirements to be able to be
able to search for lone surrogates (Unicode Regular Expressions RL1.7)
or give lone surrogates a specific position in DUCET collation (UCA
Conformance test - automatic test failure if working in UTF-8!). Moving
one 'character' backwards through a purported UTF-8 string gets so much
more interesting when one backs into E0 80 BF.
It also makes it harder to bend UTF-8 to allow U+0000 in C strings.
One trick for making essentially UTF-8 programs non-compliant is to have
test strings with embedded nulls. One solution that has been used is to
allow C0 80 to represent U+0000 in a null-terminated string.
Of course, this problem goes away if C0 is used to introduce
replacements for the formerly useful non-characters. :-)
Of course, there is the issue of what to do with F8 80 81 82 83.
Replace by one character as once legal, or by two as no character can
be represented by more than four bytes?
More information about the Unicode