Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Thu Jun 1 12:41:45 CDT 2017

I think that the (or a) key problem is that the current "best practice" is treated as "SHOULD" in RFC parlance.  When what this really needs is a "MAY".

People reading standards tend to treat "SHOULD" and "MUST" as the same thing.  So, when an implementation deviates, then you get bugs (as we see here).  Given that there are very valid engineering reasons why someone might want to choose a different behavior for their needs - without harming the intent of the standard at all in most cases - I think the current/proposed language is too "strong".

-Shawn

-----Original Message-----
From: Alastair Houghton [mailto:alastair at alastairs-place.net] 
Sent: Thursday, June 1, 2017 4:05 AM
To: Henri Sivonen <hsivonen at hsivonen.fi>
Cc: unicode Unicode Discussion <unicode at unicode.org>; Shawn Steele <Shawn.Steele at microsoft.com>
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode <unicode at unicode.org> wrote:
> 
> On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode 
> <unicode at unicode.org> wrote:
>> * As far as I can tell, there are two (maybe three) sane approaches to this problem:
>>        * Either a "maximal" emission of one U+FFFD for every byte that exists outside of a good sequence
>>        * Or a "minimal" version that presumes the lead byte was counting trail bytes correctly even if the resulting sequence was invalid.  In that case just use one U+FFFD.
>>        * And (maybe, I haven't heard folks arguing for this one) emit one U+FFFD at the first garbage byte and then ignore the input until valid data starts showing up again.  (So you could have 1 U+FFFD for a string of a hundred garbage bytes as long as there weren't any valid sequences within that group).
> 
> I think it's not useful to come up with new rules in the abstract.

The first two aren’t “new” rules; they’re, respectively, the current “Best Practice”, the proposed “Best Practice” and one other potentially reasonable approach that might make sense e.g. if the problem you’re worrying about is serial data slip or corruption of a compressed or encrypted file (where corruption will occur until re-synchronisation happens, and as a result you wouldn’t expect to have any knowledge whatever of the number of characters represented in the data in question).

All of these approaches are explicitly allowed by the standard at present.  All three are reasonable, and each has its own pros and cons in a technical sense (leaving aside how prevalent the approach in question might be).  In a general purpose library I’d probably go for the second one; if I knew I was dealing with a potentially corrupt compressed or encrypted stream, I might well plump for the third.  I can even *imagine* there being circumstances under which I might choose the first for some reason, in spite of my preference for the second approach.

I don’t think it makes sense to standardise on *one* of these approaches, so if what you’re saying is that the “Best Practice” has been treated as if it was part of the specification (and I think that *is* essentially your claim), then I’m in favour of either removing it completely, or (better) replacing it with Shawn’s suggestion - i.e. listing three reasonable approaches and telling developers to document which they take and why.

Kind regards,

Alastair.

--
http://alastairs-place.net