Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Tue May 16 08:21:53 CDT 2017

On Tue, 16 May 2017 14:44:44 +0200
Hans Åberg via Unicode <unicode at unicode.org> wrote:

> > On 15 May 2017, at 12:21, Henri Sivonen via Unicode
> > <unicode at unicode.org> wrote:  
> ...
> > I think Unicode should not adopt the proposed change.  
> 
> It would be useful, for use with filesystems, to have Unicode
> codepoint markers that indicate how UTF-8, including non-valid
> sequences, is translated into UTF-32 in a way that the original octet
> sequence can be restored.

Escape sequences for the inappropriate bytes is the natural technique.
Your problem is smoothly transitioning so that the escape character is
always escaped when it means itself. Strictly, it can't be done.

Of course, some sequences of escaped characters should be prohibited.
Checking could be fiddly.

Richard.