Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Martin J. Dürst via Unicode
unicode at unicode.org
Tue May 30 06:26:39 CDT 2017
Hello Karl, others,
On 2017/05/27 06:15, Karl Williamson via Unicode wrote:
> On 05/26/2017 12:22 PM, Ken Whistler wrote:
>> On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:
>>> The link provided about the PRI doesn't lead to the comments.
>> PRI #121 (August, 2008) pre-dated the practice of keeping all the
>> feedback comments together with the PRI itself in a numbered directory
>> with the name "feedback.html". But the comments were collected
>> together at the time and are accessible here:
>> Also there was a separately submitted comment document:
>> And the minutes of the pertinent UTC meeting (UTC #116):
>> The minutes simply capture the consensus to adopt Option #2 from PRI
>> #121, and the relevant action items.
>> I now return the floor to the distinguished disputants to continue
>> litigating history. ;-)
> The reason this discussion got started was that in December, someone
> came to me and said the code I support does not follow Unicode best
> practices, and suggested I need to change, though no ticket (yet) has
> been filed. I was surprised, and posted a query to this list about what
> the advantages of the new approach are.
Can you provide a reference to that discussion? I might have missed it
> There were a number of replies,
> but I did not see anything that seemed definitive. After a month, I
> created a ticket in Unicode and Markus was assigned to research it, and
> came up with the proposal currently being debated.
Which is to completely reverse the current recommendation in Unicode
9.0. While I agree that this might help you fending off a bug report, it
would create chances for bug reports for Ruby, Python3, many if not all
> Looking at the PRI, it seems to me that treating an overlong as a single
> maximal unit is in the spirit of the wording, if not the fine print.
In standards, the "fine print" matters.
> That seems to be borne out by Markus, even with his stake in ICU,
> supporting option #2.
Well, at http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121, I
also supported option 2, with code behind it.
> Looking at the comments, I don't see any discussion of the effect of
> this on overlong treatments. My guess is that the effect change was
I agree that it was probably not considered explicitly. But overlongs
were disallowed for security reasons, and once the definition of UTF-8
was tightened, "overlongs" essentially did not exist anymore.
Essentially, "overlong" is a word like "dragon" or "ghost": Everybody
knows what it means, but everybody knows they don't exist.
[Just to be sure, by the above, I don't mean that a sequence such as
C0 B0 cannot appear somewhere in some input. But C0 is not UTF-8 all by
itself, and there is no need to see C0 B0 as a (ghost) sequence.]
> So I have code that handled overlongs in the only correct way possible
> when they were acceptable,
No. As long as they were acceptable, they wouldn't have been replaced by
> and in the obvious way after they became illegal,
Why? A change was necessary from producing an actual character to
producing some number of FFFDs. It may have been easier to produce just
a single FFFD, but that depends on how the code was organized.
> and now without apparent discussion (which is very much akin to
> "flimsy reasons"), it suddenly was no longer "best practice".
Not 'now', but almost 9 years ago. And not "without apparent
discussion", but with an explicit PRI.
> And that
> change came "rather late in the game". That this escaped notice for
> years indicates that the specifics of REPLACEMENT CHAR handling don't
> matter all that much.
I agree. You haven't even yet received a ticket yet.
> To cut to the chase, I think Unicode should issue a Corrigendum to the
> effect that it was never the intent of this change to say that treating
> overlongs as a single unit isn't best practice. I'm not sure this
> warrants a full-fledge Corrigendum, though. But I believe the text of
> the best practices should indicate that treating overlongs as a single
> unit is just as acceptable as Martin's interpretation.
I'd essentially be fine with that, under the condition that the current
recommendation is maintained as a clearly identified recommendation, so
that Python3, Ruby, Web standards and browsers, and so on can easily
refer to it.
> I believe this is pretty much in line with Shawn's position. Certainly,
> a discussion of the reasons one might choose one interpretation over
> another should be included in TUS. That would likely have satisfied my
> original query, which hence would never have been posted.
More information about the Unicode