Corrigendum #9

David Starner prosfilaes at
Wed Jul 2 13:19:32 CDT 2014

On Wed, Jul 2, 2014 at 8:02 AM, Karl Williamson <public at> wrote:
> In
> UTF-8, an example would be that Sun, I'm told, and for reasons I've
> forgotten or never knew, did not want raw NUL bytes to appear in text
> streams, so used the overlong sequence \xC0\x80 to represent them; overlong
> sequences generally being considered "bad" because they could be used to
> insert malicious payloads into the input.

In C, NUL ends a string. If you have to run data that may have NUL
characters through C functions, you can't store the NULs as \0. I
might argue 11111111b for 0x00 in UTF-8 would be technically
legal--the standard never specifies which bit sequences correspond to
which byte values--but \xC0\x80 would probably be more reliably
processed by existing code.

Kie ekzistas vivo, ekzistas espero.

More information about the Unicode mailing list