Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Mon May 15 05:21:45 CDT 2017

In reference to:
http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf

I think Unicode should not adopt the proposed change.

The proposal is to make ICU's spec violation conforming. I think there
is both a technical and a political reason why the proposal is a bad
idea.

First, the technical reason:

ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
representative of implementation concerns of implementations that use
UTF-8 as their in-memory Unicode representation.

Even though there are notable systems (Win32, Java, C#, JavaScript,
ICU, etc.) that are stuck with UTF-16 as their in-memory
representation, which makes concerns of such implementation very
relevant, I think the Unicode Consortium should acknowledge that
UTF-16 was, in retrospect, a mistake (since Unicode grew past 16 bits
anyway making UTF-16 both variable-width *and*
ASCII-incompatible--i.e. widening the the code units to be
ASCII-incompatible didn't buy a constant-width encoding after all) and
that when the legacy constraints of Win32, Java, C#, JavaScript, ICU,
etc. don't force UTF-16 as the internal Unicode representation, using
UTF-8 as the internal Unicode representation is the technically
superior design: Using UTF-8 as the internal Unicode representation is
memory-efficient and cache-efficient when dealing with data formats
whose syntax is mostly ASCII (e.g. HTML), forces developers to handle
variable-width issues right away, makes input decode a matter of mere
validation without copy when the input is conforming and makes output
encode infinitely fast (no encode step needed).

Therefore, despite UTF-16 being widely used as an in-memory
representation of Unicode and in no way going away, I think the
Unicode Consortium should be *very* sympathetic to technical
considerations for implementations that use UTF-8 as the in-memory
representation of Unicode.

When looking this issue from the ICU perspective of using UTF-16 as
the in-memory representation of Unicode, it's easy to consider the
proposed change as the easier thing for implementation (after all, no
change for the ICU implementation is involved!). However, when UTF-8
is the in-memory representation of Unicode and "decoding" UTF-8 input
is a matter of *validating* UTF-8, a state machine that rejects a
sequence as soon as it's impossible for the sequence to be valid UTF-8
(under the definition that excludes surrogate code points and code
points beyond U+10FFFF) makes a whole lot of sense. If the proposed
change was adopted, while Draconian decoders (that fail upon first
error) could retain their current state machine, implementations that
emit U+FFFD for errors and continue would have to add more state
machine states (i.e. more complexity) to consolidate more input bytes
into a single U+FFFD even after a valid sequence is obviously
impossible.

When the decision can easily go either way for implementations that
use UTF-16 internally but the options are not equal when using UTF-8
internally, the "UTF-8 internally" case should be decisive.
(Especially when spec-wise that decision involves no change. I further
note the proposal PDF argues on the level of "feels right" without
even discussing the impact on implementations that use UTF-8
internally.)

As a matter of implementation experience, the implementation I've
written (https://github.com/hsivonen/encoding_rs) supports both the
UTF-16 as the in-memory Unicode representation and the UTF-8 as the
in-memory Unicode representation scenarios, and the fail-fast
requirement wasn't onerous in the UTF-16 as the in-memory
representation scenario.

Second, the political reason:

Now that ICU is a Unicode Consortium project, I think the Unicode
Consortium should be particular sensitive to biases arising from being
both the source of the spec and the source of a popular
implementation. It looks *really bad* both in terms of equal footing
of ICU vs. other implementations for the purpose of how the standard
is developed as well as the reliability of the standard text vs. ICU
source code as the source of truth that other implementors need to pay
attention to if the way the Unicode Consortium resolves a discrepancy
between ICU behavior and a well-known spec provision (this isn't some
ill-known corner case, after all) is by changing the spec instead of
changing ICU *especially* when the change is not neutral for
implementations that have made different but completely valid per
then-existing spec and, in the absence of legacy constraints, superior
architectural choices compared to ICU (i.e. UTF-8 internally instead
of UTF-16 internally).

I can see the irony of this viewpoint coming from a WHATWG-aligned
browser developer, but I note that even browsers that use ICU for
legacy encodings don't use ICU for UTF-8, so the ICU UTF-8 behavior
isn't, in fact, the dominant browser UTF-8 behavior. That is, even
Blink and WebKit use their own non-ICU UTF-8 decoder. The Web is the
environment that's the most sensitive to how issues like this are
handled, so it would be appropriate for the proposal to survey current
browser behavior instead of just saying that ICU "feels right" or is
"natural".

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/