Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Tue May 16 02:22:53 CDT 2017

On 5/15/2017 11:50 PM, Henri Sivonen via Unicode wrote:
> On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode
> <unicode at unicode.org> wrote:
>> I’m not sure how the discussion of “which is better” relates to the
>> discussion of ill-formed UTF-8 at all.
> Clearly, the "which is better" issue is distracting from the
> underlying issue. I'll clarify what I meant on that point and then
> move on:
>
> I acknowledge that UTF-16 as the internal memory representation is the
> dominant design. However, because UTF-8 as the internal memory
> representation is *such a good design* (when legacy constraits permit)
> that *despite it not being the current dominant design*, I think the
> Unicode Consortium should be fully supportive of UTF-8 as the internal
> memory representation and not treat UTF-16 as the internal
> representation as the one true way of doing things that gets
> considered when speccing stuff.
There are cases where it is prohibitive to transcode external data from 
UTF-8 to any other format, as a precondition to doing any work. In these 
situations processing has to be done in UTF-8, effectively making that 
the in-memory representation. I've encountered this issue on separate 
occasions, both for my own code as well as code I reviewed for clients.

I therefore think that Henri has a point when he's concerned about tacit 
assumptions favoring one memory representation over another, but I think 
the way he raises this point is needlessly antagonistic.
> ....At the very least a proposal should discuss the impact on the "UTF-8
> internally" case, which the proposal at hand doesn't do.

This is a key point. It may not be directly relevant to any other 
modifications to the standard, but the larger point is to not make 
assumption about how people implement the standard (or any of the 
algorithms).
> (Moving on to a different point.)
>
> The matter at hand isn't, however, a new green-field (in terms of
> implementations) issue to be decided but a proposed change to a
> standard that has many widely-deployed implementations. Even when
> observing only "UTF-16 internally" implementations, I think it would
> be appropriate for the proposal to include a review of what existing
> implementations, beyond ICU, do.
I would like to second this as well.

The level of documented review of existing implementation practices 
tends to be thin (at least thinner than should be required for changing 
long-established edge cases or recommendations, let alone core  
conformance requirements).
>
> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
> test with three major browsers that use UTF-16 internally and have
> independent (of each other) implementations of UTF-8 decoding
> (Firefox, Edge and Chrome) shows agreement on the current spec: there
> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
> 6 on the second, 4 on the third and 6 on the last line). Changing the
> Unicode standard away from that kind of interop needs *way* better
> rationale than "feels right".
It would be good if the UTC could work out some minimal requirements for 
evaluating proposals for changes to properties and algorithms, much like 
the criteria for encoding new code points
A./