Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Tue May 16 01:50:54 CDT 2017

On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode
<unicode at unicode.org> wrote:
> I’m not sure how the discussion of “which is better” relates to the
> discussion of ill-formed UTF-8 at all.

Clearly, the "which is better" issue is distracting from the
underlying issue. I'll clarify what I meant on that point and then
move on:

I acknowledge that UTF-16 as the internal memory representation is the
dominant design. However, because UTF-8 as the internal memory
representation is *such a good design* (when legacy constraits permit)
that *despite it not being the current dominant design*, I think the
Unicode Consortium should be fully supportive of UTF-8 as the internal
memory representation and not treat UTF-16 as the internal
representation as the one true way of doing things that gets
considered when speccing stuff.

I.e. I wasn't arguing against UTF-16 as the internal memory
representation (for the purposes of this thread) but trying to
motivate why the Consortium should consider "UTF-8 internally" equally
despite it not being the dominant design.

So: When a decision could go either way from the "UTF-16 internally"
perspective, but one way clearly makes more sense from the "UTF-8
internally" perspective, the "UTF-8 internally" perspective should be
decisive in *such a case*. (I think the matter at hand is such a
case.)

At the very least a proposal should discuss the impact on the "UTF-8
internally" case, which the proposal at hand doesn't do.

(Moving on to a different point.)

The matter at hand isn't, however, a new green-field (in terms of
implementations) issue to be decided but a proposed change to a
standard that has many widely-deployed implementations. Even when
observing only "UTF-16 internally" implementations, I think it would
be appropriate for the proposal to include a review of what existing
implementations, beyond ICU, do.

Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
test with three major browsers that use UTF-16 internally and have
independent (of each other) implementations of UTF-8 decoding
(Firefox, Edge and Chrome) shows agreement on the current spec: there
is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
6 on the second, 4 on the third and 6 on the last line). Changing the
Unicode standard away from that kind of interop needs *way* better
rationale than "feels right".

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/