Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Tue May 16 06:08:52 CDT 2017

Hello everybody,

[using this mail to in effect reply to different mails in the thread]

On 2017/05/16 17:31, Henri Sivonen via Unicode wrote:
> On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag <asmusf at ix.netcom.com> wrote:

>> Under what circumstance would it matter how many U+FFFDs you see?
>
> Maybe it doesn't, but I don't think the burden of proof should be on
> the person advocating keeping the spec and major implementations as
> they are. If anything, I think those arguing for a change of the spec
> in face of browsers, OpenJDK, Python 3 (and, likely, "etc.") agreeing
> with the current spec should show why it's important to have a
> different number of U+FFFDs than the spec's "best practice" calls for
> now.

I have just checked (the programming language) Ruby. Some background:

As you might know, Ruby is (at least in theory) pretty 
encoding-independent, meaning you can run scripts in iso-8859-1, in 
Shift_JIS, in UTF-8, or in any of quite a few other encodings directly, 
without any conversion.

However, in practice, incl. Ruby on Rails, Ruby is very much using UTF-8 
internally, and is optimized to work well that way. Character encoding 
conversion also works with UTF-8 as the pivot encoding.

As far as I understand, Ruby does the same as all of the above software, 
based (among else) on the fact that we followed the recommendation in 
the standard. Here are a few examples (sorry for the linebreaks 
introduced by mail software):

$ ruby -e 'puts "\xF0\xaf".encode("UTF-16BE", invalid: :replace).inspect'
#=>    "\uFFFD"

$ ruby -e 'puts "\xe0\x80\x80".encode("UTF-16BE", invalid: 
:replace).inspect'
#=>    "\uFFFD\uFFFD\uFFFD"

$ ruby -e 'puts "\xF4\x90\x80\x80".encode("UTF-16BE", invalid: 
:replace).inspect'
#=>    "\uFFFD\uFFFD\uFFFD\uFFFD"

$ ruby -e 'puts "\xfd\x81\x82\x83\x84\x85".encode("UTF-16BE", invalid: 
:replace).inspect
#=>    "\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD"

$ ruby -e 'puts "\x41\xc0\xaf\x41\xf4\x80\x80\x41".encode("UTF-16BE", 
invalid: :        replace).inspect'
#=>    "A\uFFFD\uFFFDA\uFFFDA"

This is based on http://www.unicode.org/review/pr-121.html as noted at
https://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/test/ruby/test_transcode.rb?revision=56516&view=markup#l1507
(for those having a look at these tests, in Ruby's version of 
assert_equal, the expected value comes first (not sure whether this is 
called little-endian or big-endian :-), but this is a decision where the 
various test frameworks are virtually split 50/50 :-(. ))

Even if the above examples and the tests use conversion to UTF-16 (in 
particular the BE variant for better readability), what happens 
internally is that the input is analyzed byte-by-byte. In this case, it 
is easiest to just stop as soon as something is found that is clearly 
invalid (be this a single byte or something longer). This makes a 
data-driven implementation (such as the Ruby transcoder) or one based on 
a state machine (such as http://bjoern.hoehrmann.de/utf-8/decoder/dfa/) 
more compact.

In other words, because we never know whether the next byte is a valid 
one such as 0x41, it's easier to just handle one byte at a time if this 
way we can avoid lookahead (which is always a good idea when parsing).

I agree with Henri and others that there is no need at all to change the 
recommendation in the standard that has been stable for so long (close 
to 9 years).

Because the original was done on a PR 
(http://www.unicode.org/review/pr-121.html), I think this should at 
least also be handled as PR (if it's not dropped based on the discussion 
here).

I think changing the current definition of "maximal subsequence" is a 
bad idea, because it would mean that one wouldn't know what one was 
speaking about over the years. If necessary, new definitions should be 
introduced for other variants.

I agree with others that ICU should not be considered to have a special 
status, it should be just one implementation among others.

[The next point is a side issue, please don't spend too much time on 
it.] I find it particularly strange that at a time when UTF-8 is firmly 
defined as up to 4 bytes, never including any bytes above 0xF4, the 
Unicode consortium would want to consider recommending that <FD 81 82 83 
84 85> be converted to a single U+FFFD. I note with agreement that 
Markus seems to have thoughts in the same direction, because the 
proposal (17168-utf-8-recommend.pdf) says "(I suppose that lead bytes 
above F4 could be somewhat debatable.)".

Regards,    Martin.