Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Hans Åberg via Unicode
unicode at unicode.org
Wed May 17 16:05:47 CDT 2017
> On 17 May 2017, at 22:36, Doug Ewell via Unicode <unicode at unicode.org> wrote:
> Hans Åberg wrote:
>> It would be useful, for use with filesystems, to have Unicode
>> codepoint markers that indicate how UTF-8, including non-valid
>> sequences, is translated into UTF-32 in a way that the original
>> octet sequence can be restored.
> I have always argued strongly against this idea, and always will.
> Far from solving the stated problem, it would introduce a new one:
> conversion from the "bad data" Unicode code points, currently
> well-defined, would become ambiguous.
Actually not: just translate the invalid UTF-8 sequences into invalid UTF-32. No Unicode extensions are needed, as it has no say about what to happen with what it considers invalid.
> File systems cannot have it both ways: they must define file names
> either as unrestricted sequences of bytes, or as strings of characters
> in some defined encoding. If they choose the latter, they need to define
> conversion mechanisms with suitable fallback and adhere to them. They
> can use the PUA if they like.
The latter is complicated, so that is not what one does I am told, with some exception. Also, one may end up with a file in an unknown encoding, say imported remotely, and then the OS cannot deal with it.
More information about the Unicode