Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Tue May 16 03:31:07 CDT 2017

On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag <asmusf at ix.netcom.com> wrote:
> but I think the way he raises this point is needlessly antagonistic.

I apologize. My level of dismay at the proposal's ICU-centricity overcame me.

On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton
<alastair at alastairs-place.net> wrote:
> That would be true if the in-memory representation had any effect on what we’re talking about, but it really doesn’t.

If the internal representation is UTF-16 (or UTF-32), it is a likely
design that there is a variable into which the scalar value of the
current code point is accumulated during UTF-8 decoding. In such a
scenario, it can be argued as "natural" to first operate according to
the general structure of UTF-8 and then inspect what you got in the
accumulation variable (ruling out non-shortest forms, values above the
Unicode range and surrogate values after the fact).

When the internal representation is UTF-8, only UTF-8 validation is
needed, and it's natural to have a fail-fast validator, which *doesn't
necessarily need such a scalar value accumulator at all*. The
construction at http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ when
used as a UTF-8 validator is the best illustration of a UTF-8
validator not necessarily looking like a "natural" UTF-8 to UTF-16
converter at all.

>>> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
>>> test with three major browsers that use UTF-16 internally and have
>>> independent (of each other) implementations of UTF-8 decoding
>>> (Firefox, Edge and Chrome) shows agreement on the current spec: there
>>> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
>>> 6 on the second, 4 on the third and 6 on the last line). Changing the
>>> Unicode standard away from that kind of interop needs *way* better
>>> rationale than "feels right”.
>
> In what sense is this “interop”?

In the sense that prominent independent implementations do the same
externally observable thing.

> Under what circumstance would it matter how many U+FFFDs you see?

Maybe it doesn't, but I don't think the burden of proof should be on
the person advocating keeping the spec and major implementations as
they are. If anything, I think those arguing for a change of the spec
in face of browsers, OpenJDK, Python 3 (and, likely, "etc.") agreeing
with the current spec should show why it's important to have a
different number of U+FFFDs than the spec's "best practice" calls for
now.

>  If you’re about to mutter something about security, consider this: security code *should* refuse to compare strings that contain U+FFFD (or at least should never treat them as equal, even to themselves), because it has no way to know what that code point represents.

In practice, e.g. the Web Platform doesn't allow for stopping
operating on input that contains an U+FFFD, so the focus is mainly on
making sure that U+FFFDs are placed well enough to prevent bad stuff
under normal operations. At least typically, the number of U+FFFDs
doesn't matter for that purpose, but when browsers agree on the number
 of U+FFFDs, changing that number should have an overwhelmingly strong
rationale. A security reason could be a strong reason, but such a
security motivation for fewer U+FFFDs has not been shown, to my
knowledge.

> Would you advocate replacing
>
>   e0 80 80
>
> with
>
>   U+FFFD U+FFFD U+FFFD     (1)
>
> rather than
>
>   U+FFFD                   (2)

I advocate (1), most simply because that's what Firefox, Edge and
Chrome do *in accordance with the currently-recommended best practice*
and, less simply, because it makes sense in the presence of a
fail-fast UTF-8 validator. I think the burden of proof to show an
overwhelmingly good reason to change should, at this point, be on
whoever proposes doing it differently than what the current
widely-implemented spec says.

> It’s pretty clear what the intent of the encoder was there, I’d say, and while we certainly don’t want to decode it as a NUL (that was the source of previous security bugs, as I recall), I also don’t see the logic in insisting that it must be decoded to *three* code points when it clearly only represented one in the input.

As noted previously, the logic is that you generate a U+FFFD whenever
a fail-fast validator fails.

> This isn’t just a matter of “feels nicer”.  (1) is simply illogical behaviour, and since behaviours (1) and (2) are both clearly out there today, it makes sense to pick the more logical alternative as the official recommendation.

Again, the current best practice makes perfect logical sense in the
context of a fail-fast UTF-8 validator. Moreover, it doesn't look like
both are "out there" equally when major browsers, OpenJDK and Python 3
agree. (I expect I could find more prominent implementations that
implement the currently-stated best practice, but I feel I shouldn't
have to.) From my experience from working on Web standards and
implementing them, I think it's a bad idea to change something to be
"more logical" when the change would move away from browser consensus.

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/