Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Mon May 15 12:52:25 CDT 2017

On 5/15/2017 8:37 AM, Alastair Houghton via Unicode wrote:
> On 15 May 2017, at 11:21, Henri Sivonen via Unicode <unicode at unicode.org> wrote:
>> In reference to:
>> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf
>>
>> I think Unicode should not adopt the proposed change.
> Disagree.  An over-long UTF-8 sequence is clearly a single error.  Emitting multiple errors there makes no sense.

Changing a specification as fundamental as this is something that should 
not be undertaken lightly.

Apparently we have a situation where implementations disagree, and have 
done so for a while. This normally means not only that the 
implementations differ, but that data exists in both formats.

Even if it were true that all data is only stored in UTF-8, any data 
converted from UFT-8 back to UTF-8 going through an interim stage that 
requires UTF-8 conversion would then be different based on which 
converter is used.

Implementations working in UTF-8 natively would potentially see three 
formats:
1) the original ill-formed data
2) data converted with single FFFD
3) data converted with multiple FFFD

These forms cannot be compared for equality by binary matching.

The best that can be done is to convert (1) into one of the other forms 
and then compare treating any run of FFFD code points as equal to any 
other run, irrespective of length.
(For security-critical applications, the presence of any FFFD should 
render the data invalid, so the comparisons we'd be talking about here 
would be for general purpose, like search).

Because we've had years of multiple implementations, it would be 
expected that copious data exists in all three formats, and that data 
will not go away. Changing the specification to pick one of these 
formats as solely conformant is IMHO too late.

A./

>
>> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
>> representative of implementation concerns of implementations that use
>> UTF-8 as their in-memory Unicode representation.
>>
>> Even though there are notable systems (Win32, Java, C#, JavaScript,
>> ICU, etc.) that are stuck with UTF-16 as their in-memory
>> representation, which makes concerns of such implementation very
>> relevant, I think the Unicode Consortium should acknowledge that
>> UTF-16 was, in retrospect, a mistake
> You may think that.  There are those of us who do not.  The fact is that UTF-16 makes sense as a default encoding in many cases.  Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the case for other situations and the fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 usually focus on) is no more complicated than handling combining characters, which you have to do anyway.
>
>> Therefore, despite UTF-16 being widely used as an in-memory
>> representation of Unicode and in no way going away, I think the
>> Unicode Consortium should be *very* sympathetic to technical
>> considerations for implementations that use UTF-8 as the in-memory
>> representation of Unicode.
> I don’t think the Unicode Consortium should be unsympathetic to people who use UTF-8 internally, for sure, but I don’t see what that has to do with either the original proposal or with your criticism of UTF-16.
>
> [snip]
>
>> If the proposed
>> change was adopted, while Draconian decoders (that fail upon first
>> error) could retain their current state machine, implementations that
>> emit U+FFFD for errors and continue would have to add more state
>> machine states (i.e. more complexity) to consolidate more input bytes
>> into a single U+FFFD even after a valid sequence is obviously
>> impossible.
> “Impossible”?  Why?  You just need to add some error states (or *an* error state and a counter); it isn’t exactly difficult, and I’m sure ICU isn’t the only library that already did just that *because it’s clearly the right thing to do*.
>
> Kind regards,
>
> Alastair.
>
> --
> http://alastairs-place.net
>
>
>