Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Henri Sivonen via Unicode unicode at
Mon May 15 13:33:18 CDT 2017

On Mon, May 15, 2017 at 6:37 PM, Alastair Houghton
<alastair at> wrote:
> On 15 May 2017, at 11:21, Henri Sivonen via Unicode <unicode at> wrote:
>> In reference to:
>> I think Unicode should not adopt the proposed change.
> Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting multiple errors there makes no sense.

The currently-specced behavior makes perfect sense when you add error
emission on top of a fail-fast UTF-8 validation state machine.

>> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
>> representative of implementation concerns of implementations that use
>> UTF-8 as their in-memory Unicode representation.
>> Even though there are notable systems (Win32, Java, C#, JavaScript,
>> ICU, etc.) that are stuck with UTF-16 as their in-memory
>> representation, which makes concerns of such implementation very
>> relevant, I think the Unicode Consortium should acknowledge that
>> UTF-16 was, in retrospect, a mistake
> You may think that.  There are those of us who do not.

My point is:
The proposal seems to arise from the "UTF-16 as the in-memory
representation" mindset. While I don't expect that case in any way to
go away, I think the Unicode Consortium should recognize the serious
technical merit of the "UTF-8 as the in-memory representation" case as
having significant enough merit that proposals like this should
consider impact to both cases equally despite "UTF-8 as the in-memory
representation" case at present appearing to be the minority case.
That is, I think it's wrong to view things only or even primarily
through the lens of the "UTF-16 as the in-memory representation" case
that ICU represents.

Henri Sivonen
hsivonen at

More information about the Unicode mailing list