Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Doug Ewell via Unicode unicode at unicode.org
Wed May 17 15:36:08 CDT 2017


Hans Åberg wrote:

> It would be useful, for use with filesystems, to have Unicode
> codepoint markers that indicate how UTF-8, including non-valid
> sequences, is translated into UTF-32 in a way that the original
> octet sequence can be restored. 

I have always argued strongly against this idea, and always will.

Far from solving the stated problem, it would introduce a new one:
conversion from the "bad data" Unicode code points, currently
well-defined, would become ambiguous.

Suppose the block U+EFFxx were assigned to invalid UTF-8 bytes <xx>.
Then there would be two possible conversions from, for instance,
U+EFF80: either <80> or <F3 AF BE 80>.

Declaring the "special" code points to be excluded from straightforward
UTF-* conversion would invalidate every existing UTF-* processor, and
would be widely ignored.

File systems cannot have it both ways: they must define file names
either as unrestricted sequences of bytes, or as strings of characters
in some defined encoding. If they choose the latter, they need to define
conversion mechanisms with suitable fallback and adhere to them. They
can use the PUA if they like. 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org




More information about the Unicode mailing list