Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Mon May 15 13:33:18 CDT 2017

On Mon, May 15, 2017 at 6:37 PM, Alastair Houghton
<alastair at alastairs-place.net> wrote:
> On 15 May 2017, at 11:21, Henri Sivonen via Unicode <unicode at unicode.org> wrote:
>>
>> In reference to:
>> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf
>>
>> I think Unicode should not adopt the proposed change.
>
> Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting multiple errors there makes no sense.

The currently-specced behavior makes perfect sense when you add error
emission on top of a fail-fast UTF-8 validation state machine.

>> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
>> representative of implementation concerns of implementations that use
>> UTF-8 as their in-memory Unicode representation.
>>
>> Even though there are notable systems (Win32, Java, C#, JavaScript,
>> ICU, etc.) that are stuck with UTF-16 as their in-memory
>> representation, which makes concerns of such implementation very
>> relevant, I think the Unicode Consortium should acknowledge that
>> UTF-16 was, in retrospect, a mistake
>
> You may think that.  There are those of us who do not.

My point is:
The proposal seems to arise from the "UTF-16 as the in-memory
representation" mindset. While I don't expect that case in any way to
go away, I think the Unicode Consortium should recognize the serious
technical merit of the "UTF-8 as the in-memory representation" case as
having significant enough merit that proposals like this should
consider impact to both cases equally despite "UTF-8 as the in-memory
representation" case at present appearing to be the minority case.
That is, I think it's wrong to view things only or even primarily
through the lens of the "UTF-16 as the in-memory representation" case
that ICU represents.

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/