Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Tue May 30 06:26:39 CDT 2017

Hello Karl, others,

On 2017/05/27 06:15, Karl Williamson via Unicode wrote:
> On 05/26/2017 12:22 PM, Ken Whistler wrote:
>>
>> On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:
>>> The link provided about the PRI doesn't lead to the comments.
>>>
>>
>> PRI #121 (August, 2008) pre-dated the practice of keeping all the 
>> feedback comments together with the PRI itself in a numbered directory 
>> with the name "feedback.html". But the comments were collected 
>> together at the time and are accessible here:
>>
>> http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121
>>
>> Also there was a separately submitted comment document:
>>
>> http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt
>>
>> And the minutes of the pertinent UTC meeting (UTC #116):
>>
>> http://www.unicode.org/L2/L2008/08253.htm
>>
>> The minutes simply capture the consensus to adopt Option #2 from PRI 
>> #121, and the relevant action items.
>>
>> I now return the floor to the distinguished disputants to continue 
>> litigating history. ;-)
>>
>> --Ken
>>
>>
> 
> The reason this discussion got started was that in December, someone 
> came to me and said the code I support does not follow Unicode best 
> practices, and suggested I need to change, though no ticket (yet) has 
> been filed.  I was surprised, and posted a query to this list about what 
> the advantages of the new approach are.

Can you provide a reference to that discussion? I might have missed it 
in December.

> There were a number of replies, 
> but I did not see anything that seemed definitive.  After a month, I 
> created a ticket in Unicode and Markus was assigned to research it, and 
> came up with the proposal currently being debated.

Which is to completely reverse the current recommendation in Unicode 
9.0. While I agree that this might help you fending off a bug report, it 
would create chances for bug reports for Ruby, Python3, many if not all 
Web browsers,...

> Looking at the PRI, it seems to me that treating an overlong as a single 
> maximal unit is in the spirit of the wording, if not the fine print.

In standards, the "fine print" matters.

> That seems to be borne out by Markus, even with his stake in ICU, 
> supporting option #2.

Well, at http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121, I 
also supported option 2, with code behind it.

> Looking at the comments, I don't see any discussion of the effect of 
> this on overlong treatments.  My guess is that the effect change was 
> unintentional.

I agree that it was probably not considered explicitly. But overlongs 
were disallowed for security reasons, and once the definition of UTF-8 
was tightened, "overlongs" essentially did not exist anymore. 
Essentially, "overlong" is a word like "dragon" or "ghost": Everybody 
knows what it means, but everybody knows they don't exist.

[Just to be sure, by the above, I don't mean that a sequence such as
C0 B0 cannot appear somewhere in some input. But C0 is not UTF-8 all by 
itself, and there is no need to see C0 B0 as a (ghost) sequence.]

> So I have code that handled overlongs in the only correct way possible 
> when they were acceptable,

No. As long as they were acceptable, they wouldn't have been replaced by 
an FFFD.

> and in the obvious way after they became illegal,

Why? A change was necessary from producing an actual character to 
producing some number of FFFDs. It may have been easier to produce just 
a single FFFD, but that depends on how the code was organized.

> and now without apparent discussion (which is very much akin to 
> "flimsy reasons"), it suddenly was no longer "best practice".

Not 'now', but almost 9 years ago. And not "without apparent 
discussion", but with an explicit PRI.

> And that 
> change came "rather late in the game".  That this escaped notice for 
> years indicates that the specifics of REPLACEMENT CHAR handling don't 
> matter all that much.

I agree. You haven't even yet received a ticket yet.

> To cut to the chase, I think Unicode should issue a Corrigendum to the 
> effect that it was never the intent of this change to say that treating 
> overlongs as a single unit isn't best practice.  I'm not sure this 
> warrants a full-fledge Corrigendum, though.  But I believe the text of 
> the best practices should indicate that treating overlongs as a single 
> unit is just as acceptable as Martin's interpretation.

I'd essentially be fine with that, under the condition that the current 
recommendation is maintained as a clearly identified recommendation, so 
that Python3, Ruby, Web standards and browsers, and so on can easily 
refer to it.

Regards,   Martin.

> I believe this is pretty much in line with Shawn's position.  Certainly, 
> a discussion of the reasons one might choose one interpretation over 
> another should be included in TUS.  That would likely have satisfied my 
> original query, which hence would never have been posted.
> .
>