Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Asmus Freytag via Unicode unicode at
Mon May 15 12:54:23 CDT 2017

On 5/15/2017 3:21 AM, Henri Sivonen via Unicode wrote:
> Second, the political reason:
> Now that ICU is a Unicode Consortium project, I think the Unicode
> Consortium should be particular sensitive to biases arising from being
> both the source of the spec and the source of a popular
> implementation. It looks*really bad*  both in terms of equal footing
> of ICU vs. other implementations for the purpose of how the standard
> is developed as well as the reliability of the standard text vs. ICU
> source code as the source of truth that other implementors need to pay
> attention to if the way the Unicode Consortium resolves a discrepancy
> between ICU behavior and a well-known spec provision (this isn't some
> ill-known corner case, after all) is by changing the spec instead of
> changing ICU*especially*  when the change is not neutral for
> implementations that have made different but completely valid per
> then-existing spec and, in the absence of legacy constraints, superior
> architectural choices compared to ICU (i.e. UTF-8 internally instead
> of UTF-16 internally).
> I can see the irony of this viewpoint coming from a WHATWG-aligned
> browser developer, but I note that even browsers that use ICU for
> legacy encodings don't use ICU for UTF-8, so the ICU UTF-8 behavior
> isn't, in fact, the dominant browser UTF-8 behavior. That is, even
> Blink and WebKit use their own non-ICU UTF-8 decoder. The Web is the
> environment that's the most sensitive to how issues like this are
> handled, so it would be appropriate for the proposal to survey current
> browser behavior instead of just saying that ICU "feels right" or is
> "natural".

I think this political reason should be taken very seriously. There are 
already too many instances where ICU can be seen "driving" the 
development of property and algorithms.

Those involved in the ICU project may not see the problem, but I agree 
with Henri that it requires a bit more sensitivity from the UTC.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list