Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Alastair Houghton via Unicode unicode at unicode.org
Mon May 15 10:37:13 CDT 2017


On 15 May 2017, at 11:21, Henri Sivonen via Unicode <unicode at unicode.org> wrote:
> 
> In reference to:
> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf
> 
> I think Unicode should not adopt the proposed change.

Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting multiple errors there makes no sense.

> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
> representative of implementation concerns of implementations that use
> UTF-8 as their in-memory Unicode representation.
> 
> Even though there are notable systems (Win32, Java, C#, JavaScript,
> ICU, etc.) that are stuck with UTF-16 as their in-memory
> representation, which makes concerns of such implementation very
> relevant, I think the Unicode Consortium should acknowledge that
> UTF-16 was, in retrospect, a mistake

You may think that.  There are those of us who do not.  The fact is that UTF-16 makes sense as a default encoding in many cases.  Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the case for other situations and the fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 usually focus on) is no more complicated than handling combining characters, which you have to do anyway.

> Therefore, despite UTF-16 being widely used as an in-memory
> representation of Unicode and in no way going away, I think the
> Unicode Consortium should be *very* sympathetic to technical
> considerations for implementations that use UTF-8 as the in-memory
> representation of Unicode.

I don’t think the Unicode Consortium should be unsympathetic to people who use UTF-8 internally, for sure, but I don’t see what that has to do with either the original proposal or with your criticism of UTF-16.

[snip]

> If the proposed
> change was adopted, while Draconian decoders (that fail upon first
> error) could retain their current state machine, implementations that
> emit U+FFFD for errors and continue would have to add more state
> machine states (i.e. more complexity) to consolidate more input bytes
> into a single U+FFFD even after a valid sequence is obviously
> impossible.

“Impossible”?  Why?  You just need to add some error states (or *an* error state and a counter); it isn’t exactly difficult, and I’m sure ICU isn’t the only library that already did just that *because it’s clearly the right thing to do*.

Kind regards,

Alastair.

--
http://alastairs-place.net




More information about the Unicode mailing list