Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Tue May 16 14:43:58 CDT 2017

On Tue, 16 May 2017 11:36:39 -0700
Markus Scherer via Unicode <unicode at unicode.org> wrote:

> Why do we care how we carve up an illegal sequence into subsequences?
> Only for debugging and visual inspection. Maybe some process is using
> illegal, overlong sequences to encode something special (à la Java
> string serialization, "modified UTF-8"), and for that it might be
> convenient too to treat overlong sequences as single errors.

I think that's not quite true.  If we are moving back and forth through
a buffer containing corrupt text, we need to make sure that moving three
characters forward and then three characters back leaves us where we
started.  That requires internal consistency.

One possible issue is with text input methods that access an
application's backing store.  They can issue updates in the form of
'delete 3 characters and insert ...'.  However, if the input method is
accessing characters it hasn't written, it's probably misbehaving
anyway.  Such commands do rather heavily assume that any
relevant normalisation by the application will be taken into account by
the input method.  I once had a go at fixing an application that was
misinterpreting 'delete x characters' as 'delete x UTF-16 code units'.
It was a horrible mess, as the application's interface layer couldn't
peek at the string being edited.

Richard.