Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Alastair Houghton via Unicode
unicode at unicode.org
Tue May 16 05:09:44 CDT 2017
On 16 May 2017, at 09:31, Henri Sivonen via Unicode <unicode at unicode.org> wrote:
> On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton
> <alastair at alastairs-place.net> wrote:
>> That would be true if the in-memory representation had any effect on what we’re talking about, but it really doesn’t.
> If the internal representation is UTF-16 (or UTF-32), it is a likely
> design that there is a variable into which the scalar value of the
> current code point is accumulated during UTF-8 decoding.
That’s quite a likely design with a UTF-8 internal representation too; it’s just that you’d only decode during processing, as opposed to immediately at input.
> When the internal representation is UTF-8, only UTF-8 validation is
> needed, and it's natural to have a fail-fast validator, which *doesn't
> necessarily need such a scalar value accumulator at all*.
Sure. But a state machine can still contain appropriate error states without needing an accumulator. That the ones you care about currently don’t is readily apparent, but there’s nothing stopping them from doing so.
I don’t see this as an argument about implementations, since it really makes very little difference to the implementation which approach is taken; in both internal representations, the question is whether you generate U+FFFD immediately on detection of the first incorrect *byte*, or whether you do so after reading a complete sequence. UTF-8 sequences are bounded anyway, so it isn’t as if failing early gives you any significant performance benefit.
>> In what sense is this “interop”?
> In the sense that prominent independent implementations do the same
> externally observable thing.
The argument is, I think, that in this case the thing they are doing is the *wrong* thing. That many of them do it would only be an argument if there was some reason that it was desirable that they did it. There doesn’t appear to be such a reason, unless you can think of something that hasn’t been mentioned thus far? The only reason you’ve given, to date, is that they currently do that, so that should be the recommended behaviour (which is little different from the argument - which nobody deployed - that ICU currently does the other thing, so *that* should be the recommended behaviour; the only difference is that *you* care about browsers and don’t care about ICU, whereas you yourself suggested that some of us might be advocating this decision because we care about ICU and not about e.g. browsers).
I’ll add also that even among the implementations you cite, some of them permit surrogates in their UTF-8 input (i.e. they’re actually processing CESU-8, not UTF-8 anyway). Python, for example, certainly accepts the sequence [ed a0 bd ed b8 80] and decodes it as U+1F600; a true “fast fail” implementation that conformed literally to the recommendation, as you seem to want, should instead replace it with *four* U+FFFDs (I think), no?
One additional note: the standard codifies this behaviour as a *recommendation*, not a requirement.
More information about the Unicode