Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Doug Ewell via Unicode
unicode at unicode.org
Wed May 17 15:37:51 CDT 2017
Richard Wordingham wrote:
>> It is not at all clear what the intent of the encoder was - or even
>> if it's not just a problem with the data stream. E0 80 80 is not
>> permitted, it's garbage. An encoder can't "intend" it.
> It was once a legal way of encoding NUL, just like C0 E0, which is
> still in use, and seems to be the best way of storing NUL as character
> content in a *C string*.
I wish I had a penny for every time I'd seen this urban legend.
At http://doc.cat-v.org/bell_labs/utf-8_history you can read the
original definition of UTF-8, from Ken Thompson on 1992-09-08, so long
ago that it was still called FSS-UTF:
"When there are multiple ways to encode a value, for example
UCS 0, only the shortest encoding is legal."
Unicode once permitted implementations to *decode* non-shortest forms,
but never allowed an implementation to *create* them
"For example, UTF-8 allows nonshortest code value sequences to be
interpreted: a UTF-8 conformant may map the code value sequence C0 80
(11000000₂ 10000000₂) to the Unicode value U+0000, even though a
UTF-8 conformant process shall never generate that code value sequence
-- it shall generate the sequence 00 (00000000₂) instead."
This was the passage that was deleted as part of Corrigendum #1.
Doug Ewell | Thornton, CO, US | ewellic.org
More information about the Unicode