Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Wed May 31 07:12:12 CDT 2017

I've researched this more. While the old advice dominates the handling
of non-shortest forms, there is more variation than I previously
thought when it comes to truncated sequences and CESU-8-style
surrogates. Still, the ICU behavior is an outlier considering the set
of implementations that I tested.

I've written up my findings at https://hsivonen.fi/broken-utf-8/

The write-up mentions
https://bugs.chromium.org/p/chromium/issues/detail?id=662822#c13 . I'd
like to draw everyone's attention to that bug, which is real-world
evidence of a bug arising from two UTF-8 decoders within one product
handling UTF-8 errors differently.

On Sun, May 21, 2017 at 7:37 PM, Mark Davis ☕️ via Unicode
<unicode at unicode.org> wrote:
> There is plenty of time for public comment, since it was targeted at Unicode
> 11, the release for about a year from now, not Unicode 10, due this year.
> When the UTC "approves a change", that change is subject to comment, and the
> UTC can always reverse or modify its approval up until the meeting before
> release date. So there are ca. 9 months in which to comment.

What should I read to learn how to formulate an appeal correctly?

Does it matter if a proposal/appeal is submitted as a non-member
implementor person, as an individual person member or as a liaison
member? http://www.unicode.org/consortium/liaison-members.html list
"the Mozilla Project" as a liaison member, but Mozilla-side
conventions make submitting proposals like this "as Mozilla"
problematic (we tend to avoid "as Mozilla" statements on technical
standardization fora except when the W3C Process forces us to make
them as part of charter or Proposed Recommendation review).

> The modified text is a set of guidelines, not requirements. So no
> conformance clause is being changed.

I'm aware of this.

> If people really believed that the guidelines in that section should have
> been conformance clauses, they should have proposed that at some point.

It seems to me that this thread does not support the conclusion that
the Unicode Standard's expression of preference for the number of
REPLACEMENT CHARACTERs should be made into a conformance requirement
in the Unicode Standard. This thread could be taken to support a
conclusion that the Unicode Standard should not express any preference
beyond "at least one and at most as many as there were bytes".

On Tue, May 23, 2017 at 12:17 PM, Alastair Houghton via Unicode
<unicode at unicode.org> wrote:
>  In any case, Henri is complaining that it’s too difficult to implement; it isn’t.  You need two extra states, both of which are trivial.

I am not claiming it's too difficult to implement. I think it
inappropriate to ask implementations, even from-scratch ones, to take
on added complexity in error handling on mere aesthetic grounds. Also,
I think it's inappropriate to induce implementations already written
according to the previous guidance to change (and risk bugs) or to
make the developers who followed the previous guidance with precision
be the ones who need to explain why they aren't following the new
guidance.

On Fri, May 26, 2017 at 6:41 PM, Markus Scherer via Unicode
<unicode at unicode.org> wrote:
> The UTF-8 conversion code that I wrote for ICU, and apparently the code that
> various other people have written, collects sequences starting from lead
> bytes, according to the original spec, and at the end looks at whether the
> assembled code point is too low for the lead byte, or is a surrogate, or is
> above 10FFFF. Stopping at a non-trail byte is quite natural, and reading the
> PRI text accordingly is quite natural too.

I don't doubt that other people have written code with the same
concept as ICU, but as far as non-shortest form handling goes in the
implementations I tested (see URL at the start of this email) ICU is
the lone outlier.

> Aside from UTF-8 history, there is a reason for preferring a more
> "structural" definition for UTF-8 over one purely along valid sequences.
> This applies to code that *works* on UTF-8 strings rather than just
> converting them. For UTF-8 *processing* you need to be able to iterate both
> forward and backward, and sometimes you need not collect code points while
> skipping over n units in either direction -- but your iteration needs to be
> consistent in all cases. This is easier to implement (especially in fast,
> short, inline code) if you have to look only at how many trail bytes follow
> a lead byte, without having to look whether the first trail byte is in a
> certain range for some specific lead bytes.

But the matter at hand is decoding potentially-invalid UTF-8 input
into a valid in-memory Unicode representation, so later processing is
somewhat a red herring as being out of scope for this step. I do agree
that if you already know that the data is valid UTF-8, it makes sense
to work from the bit pattern definition only. (E.g. in encoding_rs,
the implementation I've written and that's on track to replacing uconv
in Firefox, UTF-8 decode works using the knowledge of which bytes can
possibly follow which leads, but encode from UTF-8 to legacy encodings
works using the bit pattern definition, because the Rust type system
allows the encoder side to confidently assume that the input to the
encoder is valid UTF-8.)

On Sat, May 27, 2017 at 12:15 AM, Karl Williamson via Unicode
<unicode at unicode.org> wrote:
> The reason this discussion got started was that in December, someone came to
> me and said the code I support does not follow Unicode best practices, and
> suggested I need to change, though no ticket (yet) has been filed.

I think it's pretty uncool to inflict the problem you experienced onto
everyone who followed the previous guidance instead.

>  I was
> surprised, and posted a query to this list about what the advantages of the
> new approach are.  There were a number of replies, but I did not see
> anything that seemed definitive.  After a month, I created a ticket in
> Unicode and Markus was assigned to research it, and came up with the
> proposal currently being debated.

I think the research I linked to at the start of this email shows that
the proposal wasn't researched sufficiently before it was brought to
the Unicode Technical Committee. If anything, I hope this thread
results in the establishment of a requirement for proposals to come
with proper research about what multiple prominent implementations to
about the subject matter of a proposal concerning changes to text
about implementation behavior.

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/