Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Thu Jun 1 03:11:12 CDT 2017

On 31 May 2017, at 20:24, Shawn Steele via Unicode <unicode at unicode.org> wrote:
> 
> > For implementations that emit FFFD while handling text conversion and repair (ie, converting ill-formed
> > UTF-8 to well-formed), it is best for interoperability if they get the same results, so that indices within the
> > resulting strings are consistent across implementations for all the correct characters thereafter.
>  
> That seems optimistic :) 
>  
> If interoperability is the goal, then it would seem to me that changing the recommendation would be contrary to that goal.  There are systems that will not or cannot change to a new recommendation.  If such systems are updated, then adoption of those systems will likely take some time.

Indeed, if interoperability is the goal, the behaviour should be fully specified, not merely recommended.  At present, though, it appears that we have (broadly) two different behaviours in the wild, and nobody wants to change what they presently do.

Personally I agree with Shawn on this; the presence of a U+FFFD indicates that the input was invalid somehow.  You don’t know *how* it was invalid, and probably shouldn’t rely on equivalence with another invalid string.

There are obviously some exceptions - e.g. it *may* be desirable in the context of browsers to specify the behaviour in order to avoid behavioural differences being used for Javascript-based “fingerprinting”.  But I don’t see why WHATWG (for instance) couldn’t do that.

Kind regards,

Alastair.

--
http://alastairs-place.net