Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Tue May 16 02:42:46 CDT 2017

On 16 May 2017, at 08:22, Asmus Freytag via Unicode <unicode at unicode.org> wrote:

> I therefore think that Henri has a point when he's concerned about tacit assumptions favoring one memory representation over another, but I think the way he raises this point is needlessly antagonistic.

That would be true if the in-memory representation had any effect on what we’re talking about, but it really doesn’t.

(The only time I can think of that the in-memory representation has a significant effect is where you’re talking about default binary ordering of string data, in which case, in the presence of non-BMP characters, UTF-8 and UCS-4 sort the same way, but because the surrogates are “in the wrong place”, UTF-16 doesn’t.  I think everyone is well aware of that, no?)

>> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
>> test with three major browsers that use UTF-16 internally and have
>> independent (of each other) implementations of UTF-8 decoding
>> (Firefox, Edge and Chrome) shows agreement on the current spec: there
>> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
>> 6 on the second, 4 on the third and 6 on the last line). Changing the
>> Unicode standard away from that kind of interop needs *way* better
>> rationale than "feels right”.

In what sense is this “interop”?  Under what circumstance would it matter how many U+FFFDs you see?  If you’re about to mutter something about security, consider this: security code *should* refuse to compare strings that contain U+FFFD (or at least should never treat them as equal, even to themselves), because it has no way to know what that code point represents.

Would you advocate replacing

  e0 80 80

with

  U+FFFD U+FFFD U+FFFD     (1)

rather than

  U+FFFD                   (2)

It’s pretty clear what the intent of the encoder was there, I’d say, and while we certainly don’t want to decode it as a NUL (that was the source of previous security bugs, as I recall), I also don’t see the logic in insisting that it must be decoded to *three* code points when it clearly only represented one in the input.

This isn’t just a matter of “feels nicer”.  (1) is simply illogical behaviour, and since behaviours (1) and (2) are both clearly out there today, it makes sense to pick the more logical alternative as the official recommendation.

Kind regards,

Alastair.

--
http://alastairs-place.net