Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Doug Ewell via Unicode unicode at
Wed May 17 15:37:51 CDT 2017

Richard Wordingham wrote:

>> It is not at all clear what the intent of the encoder was - or even
>> if it's not just a problem with the data stream. E0 80 80 is not
>> permitted, it's garbage. An encoder can't "intend" it.
> It was once a legal way of encoding NUL, just like C0 E0, which is
> still in use, and seems to be the best way of storing NUL as character
> content in a *C string*.

I wish I had a penny for every time I'd seen this urban legend.

At you can read the
original definition of UTF-8, from Ken Thompson on 1992-09-08, so long
ago that it was still called FSS-UTF:

"When there are multiple ways to encode a value, for example
UCS 0, only the shortest encoding is legal."

Unicode once permitted implementations to *decode* non-shortest forms,
but never allowed an implementation to *create* them

"For example, UTF-8 allows nonshortest code value sequences to be
interpreted: a UTF-8 conformant may map the code value sequence C0 80
(11000000₂ 10000000₂) to the Unicode value U+0000, even though a
UTF-8 conformant process shall never generate that code value sequence
-- it shall generate the sequence 00 (00000000₂) instead."

This was the passage that was deleted as part of Corrigendum #1.
Doug Ewell | Thornton, CO, US |

More information about the Unicode mailing list