Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Thu Jun 1 04:32:08 CDT 2017

On Wed, May 31, 2017 at 8:11 PM, Richard Wordingham via Unicode
<unicode at unicode.org> wrote:
> On Wed, 31 May 2017 15:12:12 +0300
> Henri Sivonen via Unicode <unicode at unicode.org> wrote:
>> I am not claiming it's too difficult to implement. I think it
>> inappropriate to ask implementations, even from-scratch ones, to take
>> on added complexity in error handling on mere aesthetic grounds. Also,
>> I think it's inappropriate to induce implementations already written
>> according to the previous guidance to change (and risk bugs) or to
>> make the developers who followed the previous guidance with precision
>> be the ones who need to explain why they aren't following the new
>> guidance.
>
> How straightforward is the FSM for back-stepping?

This seems beside the point, since the new guidance wasn't advertised
as improving backward stepping compared to the old guidance.

(On the first look, I don't see the new guidance improving back
stepping. In fact, if the UTC meant to adopt ICU's behavior for
obsolete five and six-byte bit patterns, AFAICT, backstepping with the
ICU behavior requires examining more bytes backward than the old
guidance required.)

>> On Fri, May 26, 2017 at 6:41 PM, Markus Scherer via Unicode
>> <unicode at unicode.org> wrote:
>> > The UTF-8 conversion code that I wrote for ICU, and apparently the
>> > code that various other people have written, collects sequences
>> > starting from lead bytes, according to the original spec, and at
>> > the end looks at whether the assembled code point is too low for
>> > the lead byte, or is a surrogate, or is above 10FFFF. Stopping at a
>> > non-trail byte is quite natural, and reading the PRI text
>> > accordingly is quite natural too.
>>
>> I don't doubt that other people have written code with the same
>> concept as ICU, but as far as non-shortest form handling goes in the
>> implementations I tested (see URL at the start of this email) ICU is
>> the lone outlier.
>
> You should have researched implementations as they were in 2007.

I don't see how the state of things in 2007 is relevant to a decision
taken in 2017. It's relevant that by 2017, prominent implementations
had adopted the old Unicode guidance, and, that being the case, it's
inappropriate to change the guidance for aesthetic reasons or to favor
the Unicode Consortium-hosted implementation.

On Wed, May 31, 2017 at 8:43 PM, Shawn Steele via Unicode
<unicode at unicode.org> wrote:
> I do not understand the energy being invested in a case that shouldn't happen, especially in a case that is a subset of all the other bad cases that could happen.

I'm a browser developer. I've explained previously on this list and in
my blog post why the browser developer / Web standard culture favors
well-defined behavior in error cases these days.

On Wed, May 31, 2017 at 10:38 PM, Doug Ewell via Unicode
<unicode at unicode.org> wrote:
> Henri Sivonen wrote:
>
>> If anything, I hope this thread results in the establishment of a
>> requirement for proposals to come with proper research about what
>> multiple prominent implementations to about the subject matter of a
>> proposal concerning changes to text about implementation behavior.
>
> Considering that several folks have objected that the U+FFFD
> recommendation is perceived as having the weight of a requirement, I
> think adding Henri's good advice above as a "requirement" seems
> heavy-handed. Who will judge how much research qualifies as "proper"?

In the Unicode scope, it's indeed harder to draw clear line to decide
what the prominent implementations are than in the WHATWG scope. The
point is that just checking ICU is not good enough. Someone making a
proposal should check the four major browser engines and a bunch of
system frameworks and standard libraries for well-known programming
languages. Which frameworks and standard libraries and how many is not
precisely definable objectively and depends on the subject matter
(there are many UTF-8 decoders but e.g. fewer text shaping engines).
There will be diminishing returns to checking them. Chances are that
it's not necessary to check too many for a pattern to emerge to judge
whether the existing spec language is being implemented (don't change
it) or being ignored (probably should be changed then).

In any case, "we can't check everything or choose fairly what exactly
to check" shouldn't be a reason for it to be OK to just check ICU or
to make abstract arguments without checking any implementations at
all. Checking multiple popular implementations is homework better done
than just checking ICU even if it's up to the person making the
proposal to choose which implementations to check exactly. The
committee should be able to recognize if the list of implementations
tested looks like a list of broadly-deployed implementations.

On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode
<unicode at unicode.org> wrote:
> * As far as I can tell, there are two (maybe three) sane approaches to this problem:
>         * Either a "maximal" emission of one U+FFFD for every byte that exists outside of a good sequence
>         * Or a "minimal" version that presumes the lead byte was counting trail bytes correctly even if the resulting sequence was invalid.  In that case just use one U+FFFD.
>         * And (maybe, I haven't heard folks arguing for this one) emit one U+FFFD at the first garbage byte and then ignore the input until valid data starts showing up again.  (So you could have 1 U+FFFD for a string of a hundred garbage bytes as long as there weren't any valid sequences within that group).

I think it's not useful to come up with new rules in the abstract. I'd
like to focus on the fact that the Standard expressed a preference and
the preference got implemented (in broadly-deployed well-known
software). That being the case, it's not OK to change the preference
expressed in the standard as a matter of what "feels right" or "sane"
subsequently when there wasn't a super-serious problem with the
previously-expressed preference that already got implemented in
multiple pieces of broadly-deployed software whose developers took the
Standard's expression of preference seriously.

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/