Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Shawn Steele via Unicode unicode at
Thu Jun 1 13:53:36 CDT 2017

But those are IETF definitions.  They don’t have to mean the same thing in Unicode - except that people working in this field probably expect them to.

From: Unicode [mailto:unicode-bounces at] On Behalf Of Asmus Freytag via Unicode
Sent: Thursday, June 1, 2017 11:44 AM
To: unicode at
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On 6/1/2017 10:41 AM, Shawn Steele via Unicode wrote:

I think that the (or a) key problem is that the current "best practice" is treated as "SHOULD" in RFC parlance.  When what this really needs is a "MAY".

People reading standards tend to treat "SHOULD" and "MUST" as the same thing.

It's not that they "tend to", it's in RFC 2119:

 This word, or the adjective "RECOMMENDED", mean that there

   may exist valid reasons in particular circumstances to ignore a

   particular item, but the full implications must be understood and

   carefully weighed before choosing a different course.

The clear inference is that while the non-recommended practice is not prohibited, you better have some valid reason why you are deviating from it (and, reading between the lines, it would not hurt if you documented those reasons).

 So, when an implementation deviates, then you get bugs (as we see here).  Given that there are very valid engineering reasons why someone might want to choose a different behavior for their needs - without harming the intent of the standard at all in most cases - I think the current/proposed language is too "strong".

Yes and no. ICU would be perfectly fine deviating from the existing recommendation and stating their engineering reasons for doing so. That would allow them to close their bug ("by documentation").

What's not OK is to take an existing recommendation and change it to something else, just to make bug reports go away for one implementations. That's like two sleepers fighting over a blanket that's too short. Whenever one is covered, the other is exposed.

If it is discovered that the existing recommendation is not based on anything like truly better behavior, there may be a case to change it to something that's equivalent to a MAY. Perhaps a list of nearly equally capable options.

(If that language is not in the standard already, a strong "an implementation MUST not depend on the use of a particular strategy for replacement of invalid code sequences", clearly ought to be added).



-----Original Message-----

From: Alastair Houghton [mailto:alastair at]

Sent: Thursday, June 1, 2017 4:05 AM

To: Henri Sivonen <hsivonen at><mailto:hsivonen at>

Cc: unicode Unicode Discussion <unicode at><mailto:unicode at>; Shawn Steele <Shawn.Steele at><mailto:Shawn.Steele at>

Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode <unicode at><mailto:unicode at> wrote:

On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode

<unicode at><mailto:unicode at> wrote:

* As far as I can tell, there are two (maybe three) sane approaches to this problem:

       * Either a "maximal" emission of one U+FFFD for every byte that exists outside of a good sequence

       * Or a "minimal" version that presumes the lead byte was counting trail bytes correctly even if the resulting sequence was invalid.  In that case just use one U+FFFD.

       * And (maybe, I haven't heard folks arguing for this one) emit one U+FFFD at the first garbage byte and then ignore the input until valid data starts showing up again.  (So you could have 1 U+FFFD for a string of a hundred garbage bytes as long as there weren't any valid sequences within that group).

I think it's not useful to come up with new rules in the abstract.

The first two aren’t “new” rules; they’re, respectively, the current “Best Practice”, the proposed “Best Practice” and one other potentially reasonable approach that might make sense e.g. if the problem you’re worrying about is serial data slip or corruption of a compressed or encrypted file (where corruption will occur until re-synchronisation happens, and as a result you wouldn’t expect to have any knowledge whatever of the number of characters represented in the data in question).

All of these approaches are explicitly allowed by the standard at present.  All three are reasonable, and each has its own pros and cons in a technical sense (leaving aside how prevalent the approach in question might be).  In a general purpose library I’d probably go for the second one; if I knew I was dealing with a potentially corrupt compressed or encrypted stream, I might well plump for the third.  I can even *imagine* there being circumstances under which I might choose the first for some reason, in spite of my preference for the second approach.

I don’t think it makes sense to standardise on *one* of these approaches, so if what you’re saying is that the “Best Practice” has been treated as if it was part of the specification (and I think that *is* essentially your claim), then I’m in favour of either removing it completely, or (better) replacing it with Shawn’s suggestion - i.e. listing three reasonable approaches and telling developers to document which they take and why.

Kind regards,



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list