Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Tue May 16 02:50:27 CDT 2017

On Mon, May 15, 2017 at 11:50 PM, Henri Sivonen via Unicode <
unicode at unicode.org> wrote:

> On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode
> <unicode at unicode.org> wrote:
> > I’m not sure how the discussion of “which is better” relates to the
> > discussion of ill-formed UTF-8 at all.
>
> Clearly, the "which is better" issue is distracting from the
> underlying issue. I'll clarify what I meant on that point and then
> move on:
>
> I acknowledge that UTF-16 as the internal memory representation is the
> dominant design. However, because UTF-8 as the internal memory
> representation is *such a good design* (when legacy constraits permit)
> that *despite it not being the current dominant design*, I think the
> Unicode Consortium should be fully supportive of UTF-8 as the internal
> memory representation and not treat UTF-16 as the internal
> representation as the one true way of doing things that gets
> considered when speccing stuff.
>
> I.e. I wasn't arguing against UTF-16 as the internal memory
> representation (for the purposes of this thread) but trying to
> motivate why the Consortium should consider "UTF-8 internally" equally
> despite it not being the dominant design.
>
> So: When a decision could go either way from the "UTF-16 internally"
> perspective, but one way clearly makes more sense from the "UTF-8
> internally" perspective, the "UTF-8 internally" perspective should be
> decisive in *such a case*. (I think the matter at hand is such a
> case.)
>
> At the very least a proposal should discuss the impact on the "UTF-8
> internally" case, which the proposal at hand doesn't do.
>
> (Moving on to a different point.)
>
> The matter at hand isn't, however, a new green-field (in terms of
> implementations) issue to be decided but a proposed change to a
> standard that has many widely-deployed implementations. Even when
> observing only "UTF-16 internally" implementations, I think it would
> be appropriate for the proposal to include a review of what existing
> implementations, beyond ICU, do.
>
> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
> test with three major browsers that use UTF-16 internally and have
> independent (of each other) implementations of UTF-8 decoding
> (Firefox, Edge and Chrome)

Something I've learned through working with Node (V8 javascript engine from
chrome) V8 stores strings either as UTF-16 OR UTF-8 interchangably and is
not one OR the other...

https://groups.google.com/forum/#!topic/v8-users/wmXgQOdrwfY

and I wouldn't really assume UTF-16 is a 'majority';  Go is utf-8 for
instance.

> shows agreement on the current spec: there
> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
> 6 on the second, 4 on the third and 6 on the last line). Changing the
> Unicode standard away from that kind of interop needs *way* better
> rationale than "feels right".
>
> --
> Henri Sivonen
> hsivonen at hsivonen.fi
> https://hsivonen.fi/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170516/8b2a98d5/attachment.html>