Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Tue May 16 05:40:37 CDT 2017

On Tue, May 16, 2017 at 1:09 PM, Alastair Houghton
<alastair at alastairs-place.net> wrote:
> On 16 May 2017, at 09:31, Henri Sivonen via Unicode <unicode at unicode.org> wrote:
>>
>> On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton
>> <alastair at alastairs-place.net> wrote:
>>> That would be true if the in-memory representation had any effect on what we’re talking about, but it really doesn’t.
>>
>> If the internal representation is UTF-16 (or UTF-32), it is a likely
>> design that there is a variable into which the scalar value of the
>> current code point is accumulated during UTF-8 decoding.
>
> That’s quite a likely design with a UTF-8 internal representation too; it’s just that you’d only decode during processing, as opposed to immediately at input.

The time to generate the U+FFFDs is at the input time which is what's
at issue here. The later processing, which may then involve iterating
by code point and involving computing the scalar values is a different
step that should be able to assume valid UTF-8 and not be concerned
with invalid UTF-8. (To what extent different programming languages
and frameworks allow confident maintenance of the invariant that after
input all in-RAM UTF-8 can be treated as valid varies.)

>> When the internal representation is UTF-8, only UTF-8 validation is
>> needed, and it's natural to have a fail-fast validator, which *doesn't
>> necessarily need such a scalar value accumulator at all*.
>
> Sure.  But a state machine can still contain appropriate error states without needing an accumulator.

As I said upthread, it could, but it seems inappropriate to ask
implementations to take on that extra complexity on as weak grounds as
"ICU does it" or "feels right" when the current recommendation doesn't
call for those extra states and the current spec is consistent with a
number of prominent non-ICU implementations, including Web browsers.

>>> In what sense is this “interop”?
>>
>> In the sense that prominent independent implementations do the same
>> externally observable thing.
>
> The argument is, I think, that in this case the thing they are doing is the *wrong* thing.

It's seems weird to characterize following the currently-specced "best
practice" as "wrong" without showing a compelling fundamental flaw
(such as a genuine security problem) in the currently-specced "best
practice". With implementations of the currently-specced "best
practice" already shipped, I don't think aesthetic preferences should
be considered enough of a reason to proclaim behavior adhering to the
currently-specced "best practice" as "wrong".

>  That many of them do it would only be an argument if there was some reason that it was desirable that they did it.  There doesn’t appear to be such a reason, unless you can think of something that hasn’t been mentioned thus far?

I've already given a reason: UTF-8 validation code not needing to have
extra states catering to aesthetic considerations of U+FFFD
consolidation.

>  The only reason you’ve given, to date, is that they currently do that, so that should be the recommended behaviour (which is little different from the argument - which nobody deployed - that ICU currently does the other thing, so *that* should be the recommended behaviour; the only difference is that *you* care about browsers and don’t care about ICU, whereas you yourself suggested that some of us might be advocating this decision because we care about ICU and not about e.g. browsers).

Not just browsers. Also OpenJDK and Python 3. Do I really need to test
the standard libraries of more languages/systems to more strongly make
the case that the ICU behavior (according to the proposal PDF) is not
the norm and what the spec currently says is?

> I’ll add also that even among the implementations you cite, some of them permit surrogates in their UTF-8 input (i.e. they’re actually processing CESU-8, not UTF-8 anyway).  Python, for example, certainly accepts the sequence [ed a0 bd ed b8 80] and decodes it as U+1F600; a true “fast fail” implementation that conformed literally to the recommendation, as you seem to want, should instead replace it with *four* U+FFFDs (I think), no?

I see that behavior in Python 2. Earlier, I said that Python 3 agrees
with the current spec for my test case. The Python 2 behavior I see is
not just against "best practice" but obviously incompliant.

(For details: I tested Python 2.7.12 and 3.5.2 as shipped on Ubuntu 16.04.)

> One additional note: the standard codifies this behaviour as a *recommendation*, not a requirement.

This is an odd argument in favor of changing it. If the argument is
that it's just a recommendation that you don't need to adhere to,
surely then the people who don't like the current recommendation
should choose not to adhere to it instead of advocating changing it.

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/