Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Karl Williamson via Unicode unicode at
Wed May 24 17:56:39 CDT 2017

On 05/24/2017 12:46 AM, Martin J. Dürst wrote:
> On 2017/05/24 05:57, Karl Williamson via Unicode wrote:
>> On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote:
>>> Adding a "recommendation" this late in the game is just bad standards
>>> policy.
>> Unless I misunderstand, you are missing the point.  There is already a
>> recommendation listed in TUS,
> That's indeed correct.
>> and that recommendation appears to have
>> been added without much thought.
> That's wrong. There was a public review issue with various options and 
> with feedback, and the recommendation has been implemented and in use 
> widely (among else, in major programming language and browsers) without 
> problems for quite some time.

Could you supply a reference to the PRI and its feedback?

The recommendation in TUS 5.2 is "Replace each maximal subpart of an 
ill-formed subsequence by a single U+FFFD."

And I agree with that.  And I view an overlong sequence as a maximal 
ill-formed subsequence that should be replaced by a single FFFD. 
There's nothing in the text of 5.2 that immediately follows that 
recommendation that indicates to me that my view is incorrect.

Perhaps my view is colored by the fact that I now maintain code that was 
written to parse UTF-8 back when overlongs were still considered legal 
input.  An overlong was a single unit.  When they became illegal, the 
code still considered them a single unit.

I can understand how someone who comes along later could say C0 can't be 
followed by any continuation character that doesn't yield an overlong, 
therefore C0 is a maximal subsequence.

But I assert that my interpretation is just as valid as that one.  And 
perhaps more so, because of historical precedent.

It appears to me that little thought was given to the fact that these 
changes would cause overlongs to now be at least two units instead of 
one, making long existing code no longer be best practice.  You are 
effectively saying I'm wrong about this.  I thought I had been paying 
attention to PRI's since the 5.x series, and I don't remember anything 
about this.  If you have evidence to the contrary, please give it. 
However, I would have thought Markus would have dug any up and given it 
in his proposal.

>> There is no proposal to add a
>> recommendation "this late in the game".
> True. The proposal isn't for an addition, it's for a change. The "late 
> in the game" however, still applies.
> Regards,   Martin.

More information about the Unicode mailing list