Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Tue May 16 08:23:55 CDT 2017

> On 16 May 2017, at 15:00, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 
> 2017-05-16 14:44 GMT+02:00 Hans Åberg via Unicode <unicode at unicode.org>:
> 
> > On 15 May 2017, at 12:21, Henri Sivonen via Unicode <unicode at unicode.org> wrote:
> ...
> > I think Unicode should not adopt the proposed change.
> 
> It would be useful, for use with filesystems, to have Unicode codepoint markers that indicate how UTF-8, including non-valid sequences, is translated into UTF-32 in a way that the original octet sequence can be restored.
> 
> Why just UTF-32 ?

Synonym for codepoint numbers. It would suffice to add markers how it is translated. For example, codepoints meaning "overlong long length <number>", "byte", or whatever is useful.

> How would you convert ill-formed UTF-8/UTF-16/UTF-32 to valid UTF-8/UTF-16/UTF-32 ?

You don't. You have a filename, which is a octet sequence of unknown encoding, and want to deal with it. Therefore, valid Unicode transformations of the filename may result in that is is not being reachable.

It only matters that the correct octet sequence is handed back to the filesystem. All current filsystems, as far as experts could recall, use octet sequences at the lowest level; whatever encoding is used is built in a layer above.