Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Thu Jun 29 13:32:51 CDT 2017

On Sat Jun 3 23:09:01 CDT 2017Sat Jun 3 23:09:01 CDT 2017 Markus Scherer wrote:
> I suggest you submit a write-up via http://www.unicode.org/reporting.html
>
> and make the case there that you think the UTC should retract
>
> http://www.unicode.org/L2/L2017/17103.htm#151-C19

The submission has been made:
http://www.unicode.org/L2/L2017/17197-utf8-retract.pdf

> Also, since Chromium/Blink/v8 are using ICU, I suggest you submit an ICU
> ticket via http://bugs.icu-project.org/trac/newticket

Although they use ICU for most legacy encodings, they don't use ICU
for UTF-8. Hence, the difference between Chrome and ICU in the above
write-up.

> and make the case there, too, that you think (assuming you do) that ICU
> should change its handling of illegal UTF-8 sequences.

Whether I think ICU should change isn't quite that simple.

On one hand, a key worry that I have about Unicode changing the
long-standing guidance for UTF-8 error handling is that inducing
implementations to change (either by the developers feeling that they
have to implement the "best practice" or by others complaining when
"best practice" isn't implemented) is wasteful and a potential source
of bugs. In that sense, I feel I shouldn't ask ICU to change, either.

On the other hand, I care about implementations of the WHATWG Encoding
Standard being compliant and it appears that Node.js is on track to
exposing ICU's UTF-8 decoder via the WHATWG TextDecoder API:
https://github.com/nodejs/node/pull/13644 . Additionally, this episode
of ICU behavior getting cited in a proposal to change the guidance in
the Unicode Standard is a reason why I'd be happier if ICU followed
the Unicode 10-and-earlier / WHATWG behavior, since there wouldn't be
the risk of ICU's behavior getting cited as a different reference as
happened with the proposal to change the guidance for Unicode 11.

Still, since I'm not affiliated with the Node.js implementation, I'm a
bit worried that if I filed an ICU bug on Node's behalf, I'd be
engaging in the kind of behavior towards ICU that I don't want to see
towards other implementations, including the one I've written, in
response to the new pending Unicode 11 guidance (which I'm requesting
be retracted), so at this time I haven't filed an ICU bug on Node's
behalf and have instead mentioned the difference between ICU and the
WHATWG spec when my input on testing the Node TextDecoder
implementation was sought
(https://github.com/nodejs/node/issues/13646#issuecomment-308084459).

>> But the matter at hand is decoding potentially-invalid UTF-8 input
>> into a valid in-memory Unicode representation, so later processing is
>> somewhat a red herring as being out of scope for this step. I do agree
>> that if you already know that the data is valid UTF-8, it makes sense
>> to work from the bit pattern definition only.
>
> No, it's not a red herring. Not every piece of software has a neat "inside"
> with all valid text, and with a controllable surface to the "outside".

Fair enough. However, I don't think this supports adopting the ICU
behavior as "best practice" when looking at a prominent real-world
example of such a system.

The Go programming language is a example of a system that post-dates
UTF-8, is even designed by the same people as UTF-8 and where strings
in memory are potentially-invalid UTF-8, i.e. there isn't a clear
distinction with UTF-8 on the outside and UTF-8 on the inside. (In
contrast to e.g. Rust where the type system maintains a clear
distinction between byte buffers and strings, and strings are
guaranteed-valid UTF-8.)

Go bakes UTF-8 error handling in the language spec by specifying
per-code point iteration over potentially-invalid in-memory UTF-8
buffers. See item 2 in the list at
https://golang.org/ref/spec#For_range .

The behavior baked into the language is one REPLACEMENT CHARACTER per
bogus byte, which is neither the Unicode 10-and-earlier "best
practice" nor the ICU behavior. However, it is closer to the Unicode
10-and-earlier "best practice" than to the ICU behavior. (It differs
from the Unicode-and-earlier behavior only for truncated sequences
that form a prefix of a valid sequence.)

(To be clear, I not saying that the guidance in the Unicode Standard
should be changed to match Go, either. I'm just saying that Go is an
example of a prominent system with ambiguous inside and outside for
UTF-8 and it exhibits behavior closer to Unicode 10 than to ICU and,
therefore, is not a data point in favor of adopting the ICU behavior.)

-- 
Henri Sivonen
hsivonen at mozilla.com