Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Shawn Steele via Unicode
unicode at unicode.org
Fri May 26 16:41:49 CDT 2017
So basically this came about because code got bugged for not following the "recommendation." To fix that, the recommendation will be changed. However then that is going to lead to bugs for other existing code that does not follow the new recommendation.
I totally get the forward/backward scanning in sync without decoding reasoning for some implementations, however I do not think that the practices that benefit those should extend to other applications that are happy with a different practice.
In either case, the bad characters are garbage, so neither approach is "better" - except that one or the other may be more conducive to the requirements of the particular API/application.
I really think the correct approach here is to allow any number of replacement characters without prejudice. Perhaps with suggestions for pros and cons of various approaches if people feel that is really necessary.
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Karl Williamson via Unicode
Sent: Friday, May 26, 2017 2:16 PM
To: Ken Whistler <kenwhistler at att.net>
Cc: unicode at unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 05/26/2017 12:22 PM, Ken Whistler wrote:
> On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:
>> The link provided about the PRI doesn't lead to the comments.
> PRI #121 (August, 2008) pre-dated the practice of keeping all the
> feedback comments together with the PRI itself in a numbered directory
> with the name "feedback.html". But the comments were collected
> together at the time and are accessible here:
> Also there was a separately submitted comment document:
> And the minutes of the pertinent UTC meeting (UTC #116):
> The minutes simply capture the consensus to adopt Option #2 from PRI
> #121, and the relevant action items.
> I now return the floor to the distinguished disputants to continue
> litigating history. ;-)
The reason this discussion got started was that in December, someone came to me and said the code I support does not follow Unicode best practices, and suggested I need to change, though no ticket (yet) has been filed. I was surprised, and posted a query to this list about what the advantages of the new approach are. There were a number of replies, but I did not see anything that seemed definitive. After a month, I created a ticket in Unicode and Markus was assigned to research it, and came up with the proposal currently being debated.
Looking at the PRI, it seems to me that treating an overlong as a single maximal unit is in the spirit of the wording, if not the fine print.
That seems to be borne out by Markus, even with his stake in ICU, supporting option #2.
Looking at the comments, I don't see any discussion of the effect of this on overlong treatments. My guess is that the effect change was unintentional.
So I have code that handled overlongs in the only correct way possible when they were acceptable, and in the obvious way after they became illegal, and now without apparent discussion (which is very much akin to "flimsy reasons"), it suddenly was no longer "best practice". And that change came "rather late in the game". That this escaped notice for years indicates that the specifics of REPLACEMENT CHAR handling don't matter all that much.
To cut to the chase, I think Unicode should issue a Corrigendum to the effect that it was never the intent of this change to say that treating overlongs as a single unit isn't best practice. I'm not sure this warrants a full-fledge Corrigendum, though. But I believe the text of the best practices should indicate that treating overlongs as a single unit is just as acceptable as Martin's interpretation.
I believe this is pretty much in line with Shawn's position. Certainly, a discussion of the reasons one might choose one interpretation over another should be included in TUS. That would likely have satisfied my original query, which hence would never have been posted.
More information about the Unicode