Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Richard Wordingham via Unicode unicode at unicode.org
Wed May 17 17:04:18 CDT 2017


On Wed, 17 May 2017 13:37:51 -0700
Doug Ewell via Unicode <unicode at unicode.org> wrote:

> Richard Wordingham wrote:
> 
> >> It is not at all clear what the intent of the encoder was - or even
> >> if it's not just a problem with the data stream. E0 80 80 is not
> >> permitted, it's garbage. An encoder can't "intend" it.  
> >
> > It was once a legal way of encoding NUL, just like C0 E0, which is
> > still in use, and seems to be the best way of storing NUL as
> > character content in a *C string*.  
> 
> I wish I had a penny for every time I'd seen this urban legend.
> 
> At http://doc.cat-v.org/bell_labs/utf-8_history you can read the
> original definition of UTF-8, from Ken Thompson on 1992-09-08, so long
> ago that it was still called FSS-UTF:
> 
> "When there are multiple ways to encode a value, for example
> UCS 0, only the shortest encoding is legal."
> 
> Unicode once permitted implementations to *decode* non-shortest forms,
> but never allowed an implementation to *create* them
> (http://www.unicode.org/versions/corrigendum1.html):
> 
> "For example, UTF-8 allows nonshortest code value sequences to be
> interpreted: a UTF-8 conformant may map the code value sequence C0 80
> (11000000₂ 10000000₂) to the Unicode value U+0000, even though a
> UTF-8 conformant process shall never generate that code value sequence
> -- it shall generate the sequence 00 (00000000₂) instead."
> 
> This was the passage that was deleted as part of Corrigendum #1.

So it was still a legal way for a non-UTF-8-compliant process!  Note
for example that a compliant implementation of full upper-casing
shall convert the canonically equivalent strings <U+1FB3 GREEK SMALL
LETTER ALPHA WITH YPOGEGRAMMENI, U+0313 COMBINING COMMA ABOVE> and
<U+1F00 GREEK SMALL LETTER ALPHA WITH PSILI, U+0345 COMBINING GREEK
YPOGEGRAMMENI>  to the canonically inequivalent strings <U+0391 GREEK
CAPITAL LETTER ALPHA, U+0399 GREEK CAPITAL LETTER IOTA, U+0313> and
<U+1F08 GREEK CAPITAL LETTER ALPHA WITH PSILI, 0399 GREEK CAPITAL
LETTER IOTA>.  A compliant Unicode process may not assume that this is
the right thing to do.  (Or are some compliant Unicode processes
required to incorrectly believe that they are doing something they
mustn't do?)

Richard.



More information about the Unicode mailing list