Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Tue May 16 08:00:33 CDT 2017

2017-05-16 14:44 GMT+02:00 Hans Åberg via Unicode <unicode at unicode.org>:

>
> > On 15 May 2017, at 12:21, Henri Sivonen via Unicode <unicode at unicode.org>
> wrote:
> ...
> > I think Unicode should not adopt the proposed change.
>
> It would be useful, for use with filesystems, to have Unicode codepoint
> markers that indicate how UTF-8, including non-valid sequences, is
> translated into UTF-32 in a way that the original octet sequence can be
> restored.

Why just UTF-32 ? How would you convert ill-formed UTF-8/UTF-16/UTF-32 to
valid UTF-8/UTF-16/UTF-32 ?

In all cases this would require extensions on the 3 standards (which MUST
be interoperable), then you'll shoke on new validation rules for these 3
standards for these extensions, and new ill-formed sequences that you won't
be able to convert interoperably. Given the most restrictive condition in
UTF-16 (which is still the most widely used internal representation), such
extensions would be very complex too manage.

There's no solution, such extensions in any one of them are then
undesirable and can only be used privately (but without interoperating with
the other 2 representations), so it's impossible to make sure the original
octet sequences can be restored.

Any deviation of the UTF-8/16/32 will be bounded in the same UTF. It cannot
be part of the 3 standard UTF, but may be part of a distinct encoding, not
fully compatible with the 3 standards.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170516/35c0bbfd/attachment.html>