Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Tue May 16 02:23:14 CDT 2017

On Tue, May 16, 2017 at 9:50 AM, Henri Sivonen <hsivonen at hsivonen.fi> wrote:
> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick
> test with three major browsers that use UTF-16 internally and have
> independent (of each other) implementations of UTF-8 decoding
> (Firefox, Edge and Chrome) shows agreement on the current spec: there
> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line,
> 6 on the second, 4 on the third and 6 on the last line). Changing the
> Unicode standard away from that kind of interop needs *way* better
> rationale than "feels right".

Testing with that file, Python 3 and OpenJDK 8 agree with the
currently-specced best-practice, too. I expect there to be other
well-known implementations that comply with the currently-specced best
practice, so the rationale to change the stated best practice would
have to be very strong (as in: security problem with currently-stated
best practice) for a change to be appropriate.

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/