Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Philippe Verdy via Unicode unicode at
Tue May 16 05:44:00 CDT 2017

> The proposal actually does cover things that aren’t structurally valid,
> like your e0 e0 e0 example, which it suggests should be a single U+FFFD
> because the initial e0 denotes a three byte sequence, and your 80 80 80
> example, which it proposes should constitute three illegal subsequences
> (again, both reasonable).  However, I’m not entirely certain about things
> like
>   e0 e0 c3 89
> which the proposal would appear to decode as
> instead of a perhaps more reasonable
>   U+FFFD U+FFFD U+00C9         (4)
> (the key part is the “without ever restricting trail bytes to less than
> 80..BF”)

I also agree with that, due to access in strings from random position: if
you access it from byte 0x89, you can assume it's a trialing byte and
you'll want to look backward, and will see 0xc3,0x89 which will decode
correctly as U+00C9 without any error detected.

So the wrong bytes are only the initial two occurences of 0x80 which are
individually converted to U+FFFD.

In summary: when you detect any ill-formed sequence, only replace the first
code unit by U+FFFD and restart scanning from the next code unit, without
skeeping over multiple bytes.

This means that multiple occurences of U+FFFD is not only the best
practice, it also matches the intended design of UTF-8 to allow access from
random positions.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list