Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Mark Davis ☕️ via Unicode unicode at unicode.org
Thu Aug 3 19:34:15 CDT 2017


FYI, the UTC retracted the following.

*[151-C19 <http://www.unicode.org/cgi-bin/GetL2Ref.pl?151-C19>]
Consensus:* Modify
the section on "Best Practices for Using FFFD" in section "3.9 Encoding
Forms" of TUS per the recommendation in L2/17-168
<http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/17-168>, for Unicode
version 11.0.

Mark

(https://twitter.com/mark_e_davis)

On Wed, May 24, 2017 at 3:56 PM, Karl Williamson via Unicode <
unicode at unicode.org> wrote:

> On 05/24/2017 12:46 AM, Martin J. Dürst wrote:
>
>> On 2017/05/24 05:57, Karl Williamson via Unicode wrote:
>>
>>> On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote:
>>>
>>
>> Adding a "recommendation" this late in the game is just bad standards
>>>> policy.
>>>>
>>>
>> Unless I misunderstand, you are missing the point.  There is already a
>>> recommendation listed in TUS,
>>>
>>
>> That's indeed correct.
>>
>>
>> and that recommendation appears to have
>>> been added without much thought.
>>>
>>
>> That's wrong. There was a public review issue with various options and
>> with feedback, and the recommendation has been implemented and in use
>> widely (among else, in major programming language and browsers) without
>> problems for quite some time.
>>
>
> Could you supply a reference to the PRI and its feedback?
>
> The recommendation in TUS 5.2 is "Replace each maximal subpart of an
> ill-formed subsequence by a single U+FFFD."
>
> And I agree with that.  And I view an overlong sequence as a maximal
> ill-formed subsequence that should be replaced by a single FFFD. There's
> nothing in the text of 5.2 that immediately follows that recommendation
> that indicates to me that my view is incorrect.
>
> Perhaps my view is colored by the fact that I now maintain code that was
> written to parse UTF-8 back when overlongs were still considered legal
> input.  An overlong was a single unit.  When they became illegal, the code
> still considered them a single unit.
>
> I can understand how someone who comes along later could say C0 can't be
> followed by any continuation character that doesn't yield an overlong,
> therefore C0 is a maximal subsequence.
>
> But I assert that my interpretation is just as valid as that one.  And
> perhaps more so, because of historical precedent.
>
> It appears to me that little thought was given to the fact that these
> changes would cause overlongs to now be at least two units instead of one,
> making long existing code no longer be best practice.  You are effectively
> saying I'm wrong about this.  I thought I had been paying attention to
> PRI's since the 5.x series, and I don't remember anything about this.  If
> you have evidence to the contrary, please give it. However, I would have
> thought Markus would have dug any up and given it in his proposal.
>
>
>
>>
>> There is no proposal to add a
>>> recommendation "this late in the game".
>>>
>>
>> True. The proposal isn't for an addition, it's for a change. The "late in
>> the game" however, still applies.
>>
>> Regards,   Martin.
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170803/92ea2d71/attachment.html>


More information about the Unicode mailing list