Best practices for replacing UTF-8 overlongs

Mon Dec 19 20:56:58 CST 2016

On Mon, Dec 19, 2016 at 3:04 PM, Karl Williamson <public at khwilliamson.com>
wrote:

> It seems counterintuitive to me that the two byte sequence C0 80 should be
> replaced by 2 replacement characters under best practices, or that E0 80 80
> should also be replaced by 2.  Each sequence was legal in early Unicode
> versions, and it seems that it would be best to treat them as each a single
> sequence, replacing by a single replacement character.
>
> What are the advantages to replacing them by multiple characters
>

C0 80 is about the only exception; due to the prevalent use of '\0' as end
of string.
I tend not to generate that unless coming from wchar_t to utf8, and the
length exceeds the characters

Most things will die badly when fed 'overlong' characters, because
everything should be represented with least possible bits... (0-0x7f is
just 1 char, but c0 80 is not nessecariy 0)

and really is otherwise illegal to most places that implement codepoint
conversions...

there were many 'legal' definitions that just will never be used because
there is really a finite number of characters under 20 bits.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161219/865ca01c/attachment.html>