Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Asmus Freytag via Unicode unicode at
Mon May 15 15:49:05 CDT 2017

On 5/15/2017 11:33 AM, Henri Sivonen via Unicode wrote:
>>> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
>>> representative of implementation concerns of implementations that use
>>> UTF-8 as their in-memory Unicode representation.
>>> Even though there are notable systems (Win32, Java, C#, JavaScript,
>>> ICU, etc.) that are stuck with UTF-16 as their in-memory
>>> representation, which makes concerns of such implementation very
>>> relevant, I think the Unicode Consortium should acknowledge that
>>> UTF-16 was, in retrospect, a mistake
>> You may think that.  There are those of us who do not.
> My point is:
> The proposal seems to arise from the "UTF-16 as the in-memory
> representation" mindset. While I don't expect that case in any way to
> go away, I think the Unicode Consortium should recognize the serious
> technical merit of the "UTF-8 as the in-memory representation" case as
> having significant enough merit that proposals like this should
> consider impact to both cases equally despite "UTF-8 as the in-memory
> representation" case at present appearing to be the minority case.
> That is, I think it's wrong to view things only or even primarily
> through the lens of the "UTF-16 as the in-memory representation" case
> that ICU represents.
UTF-16 has some nice properties and there's not need to brand it a 
"mistake". UTF-8 has different nice properties, but there's equally not 
reason to treat it as more special than UTF-16.

The UTC should adopt a position of perfect neutrality when it comes to 
assuming in-memory representation, in other words, not make assumptions 
that optimizing for any encoding form will benefit implementers.

UTC, where ICU is strongly represented, needs to guard against basing 
encoding/properties/algorithm decisions (edge cases mostly), solely or 
primarily on the needs of a particular implementation that happens to be 
chosen by the ICU project.


More information about the Unicode mailing list