Corrigendum #9

David Starner prosfilaes at gmail.com
Wed Jul 2 13:19:32 CDT 2014


On Wed, Jul 2, 2014 at 8:02 AM, Karl Williamson <public at khwilliamson.com> wrote:
> In
> UTF-8, an example would be that Sun, I'm told, and for reasons I've
> forgotten or never knew, did not want raw NUL bytes to appear in text
> streams, so used the overlong sequence \xC0\x80 to represent them; overlong
> sequences generally being considered "bad" because they could be used to
> insert malicious payloads into the input.

In C, NUL ends a string. If you have to run data that may have NUL
characters through C functions, you can't store the NULs as \0. I
might argue 11111111b for 0x00 in UTF-8 would be technically
legal--the standard never specifies which bit sequences correspond to
which byte values--but \xC0\x80 would probably be more reliably
processed by existing code.

-- 
Kie ekzistas vivo, ekzistas espero.


More information about the Unicode mailing list