Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Henri Sivonen via Unicode unicode at
Wed May 17 01:03:49 CDT 2017

On Tue, May 16, 2017 at 9:36 PM, Markus Scherer < at> wrote:
> Let me try to address some of the issues raised here.

Thank you.

> The proposal changes a recommendation, not a requirement.

This is a very bad reason in favor of the change. If anything, this
should be a reason why there is no need to change the spec text.

> Conformance
> applies to finding and interpreting valid sequences properly. This includes
> not consuming parts of valid sequences when dealing with illegal ones, as
> explained in the section "Constraints on Conversion Processes".
> Otherwise, what you do with illegal sequences is a matter of what you think
> makes sense -- a matter of opinion and convenience. Nothing more.

This may be the Unicode-level view of error handling. It isn't the
Web-level view of error handling. In the world of Web standards (i.e.
standards that read on the behavior of browsers engines), we've
learned that implementation-defined behavior is bad, because someone
makes a popular site that depends on the implementation-defined
behavior of the browser they happened to test in. For this reason, the
WHATWG has since 2004 written specs that are well-defined even in
corner cases and for non-conforming input, and we've tried to extend
this culture into the W3C, too. (Sometimes, exceptions are made when
there's a very good reason to handle a corner case differently in a
given implementatino: A recent example is CSS allowing the
non-preservation of lone surrogates entering the CSS Object Model via
JavaScript strings in order to enable CSS Object Model implementations
that use UTF-8 [really UTF-8 and not some almost-UTF-8 variant]
internally. But, yes, we really do sweat the details on that level.)

Even if one could argue that implementation-defined behavior on the
topic of number of U+FFFDs for ill-formed sequences in UTF-8 decode
doesn't matter, the WHATWG way of doing things isn't to debate whether
implementation-defined behavior matters in this particular case but to
require one particular behavior in order to have well-defined behavior
even when input is non-conforming.

It further seems that there are people who do care about what's a
*requirement* on the WHATWG level matching what's "best practice" on
the Unicode level:

Now that major browsers agree, knowing what I know about how the
WHATWG operates, while I can't speak for Anne, I expect the WHATWG
spec to say as-is, because it now matches the browser consensus.

So as a practical matter, if Unicode now changes its "best practice",
when people check consistency with Unicode-level "best practice" and
notice a discrepancy, the WHATWG and developers of implementations
that took the previously-stated "best practice" seriously (either
directly or by the means of another spec, like the WHATWG Encoding
Standard, elevating it to a *requirement*) will need to explain why
they don't follow the best practice.

It is really inappropriate to inflict that trouble onto pretty much
everyone except ICU when the rationale for change is as flimsy as
"feels right". And, as noted earlier, politically it looks *really
bad* for Unicode to change its own previous recommendation to side
with ICU not following it when a number of other prominent
implementations do.

> I believe that the discussion of how to handle illegal sequences came out of
> security issues a few years ago from some implementations including valid
> single and lead bytes with preceding illegal sequences.
> Why do we care how we carve up an illegal sequence into subsequences? Only
> for debugging and visual inspection.
> If you don't like some recommendation, then do something else. It does not
> matter. If you don't reject the whole input but instead choose to replace
> illegal sequences with something, then make sure the something is not
> nothing -- replacing with an empty string can cause security issues.
> Otherwise, what the something is, or how many of them you put in, is not
> very relevant. One or more U+FFFDs is customary.

When the recommendation came about for security reasons, it's a really
bad idea that to suggest that implementors should decide on their own
what to do and trust that their decision deviates little enough from
the suggestion to stay on the secure side. To be clear, I'm not, at
this time, claiming that the number of U+FFFDs has a security
consequence as long as the number is at least one, but there's an
awfully short slippery slope to giving the caller of a converter API
the option to "ignore errors", i.e. make the number zero, which *is*,
as you note, a security problem.

> When the current recommendation came in, I thought it was reasonable but
> didn't like the edge cases. At the time, I didn't think it was important to
> twiddle with the text in the standard, and I didn't care that ICU didn't
> exactly implement that particular recommendation.

If ICU doesn't care, then it should be ICU developers and not the
developers of other implementations who respond to bug reports about
not following the "best practice".

> Karl Williamson sent feedback to the UTC, "In short, I believe the best
> practices are wrong." I think "wrong" is far too strong, but I got an action
> item to propose a change in the text. I proposed a modified recommendation.
> Nothing gets elevated to "right" that wasn't, nothing gets demoted to
> "wrong" that was "right".

I find it shocking that the Unicode Consortium would change a
widely-implemented part of the standard (regardless of whether Unicode
itself officially designates it as a requirement or suggestion) on
such flimsy grounds.

I'd like to register my feedback that I believe changing the best
practices is wrong.

> no one is forced to do something they don't like

I don't believe this to be *practically* true when
 1) other specs elevate into requirements what are mere suggestions on
the Unicode level
 2) people who read specs carefully file bugs for discrepancies
between implementations and best practice
 3) test suites will test things a particular way and the easy way for
test suite authors to settle arguments is to let the "best practice"

Henri Sivonen
hsivonen at

More information about the Unicode mailing list