Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Wed May 31 14:04:41 CDT 2017

> I do not understand the energy being invested in a case that shouldn't
happen, especially in a case that is a subset of all the other bad cases
that could happen.

I think Richard stated the most compelling reason:

… The bug you mentioned arose from two different ways of
counting the string length in 'characters'.  Having two different
'character' counts for the same string is inviting trouble.

For implementations that emit FFFD while handling text conversion and
repair (ie, converting ill-formed UTF-8 to well-formed), it is best for
interoperability if they get the same results, so that indices within the
resulting strings are consistent across implementations for all the
*correct* characters thereafter.

It would be preferable *not* to have the following:

source = %c0%80abc

Vendor 1:
fixed = fix(source)
fixed == �abc
codepointAt(fixed, 3) == 'b'

Vendor2:
fixed = fix(source)
fixed == ��abc
codepointAt(fixed, 3) =
=
'
c
'

In theory one could just throw an exception. In practice, nobody wants
their browser

to belly up on a webpage with a component that has an ill-formed bit of
UTF-8.

I
n theory one could document everyone's flavor of the month for how many
FFFD's to emit. In practice, that falls apart immediately, since in today's
interconnected world you can't tell which processes get first crack at text
repair.

Mark

On Wed, May 31, 2017 at 7:43 PM, Shawn Steele via Unicode <
unicode at unicode.org> wrote:

> > > In either case, the bad characters are garbage, so neither approach is
> > > "better" - except that one or the other may be more conducive to the
> > > requirements of the particular API/application.
>
> > There's a potential issue with input methods that indirectly edit the
> backing store.  For example,
> > GTK input methods (e.g. function gtk_im_context_delete_surrounding())
> can delete an amount
> > of text specified in characters, not storage units.  (Deletion by
> storage units is not available in this
> > interface.)  This might cause utter confusion or worse if the backing
> store starts out corrupt.
> > A corrupt backing store is normally manually correctable if most of the
> text is ASCII.
>
> I think that's sort of what I said: some approaches might work better for
> some systems and another approach might work better for another system.
> This also presupposes a corrupt store.
>
> It is unclear to me what the expected behavior would be for this
> corruption if, for example, there were merely a half dozen 0x80 in the
> middle of ASCII text?  Is that garbage a single "character"?  Perhaps
> because it's a consecutive string of bad bytes?  Or should it be 6
> characters since they're nonsense?  Or maybe 2 characters because the
> maximum # of trail bytes we can have is 3?
>
> What if it were 2 consecutive 2-byte sequence lead bytes and no trail
> bytes?
>
> I can see how different implementations might be able to come up with
> "rules" that would help them navigate (or clean up) those minefields,
> however it is not at all clear to me that there is a "best practice" for
> those situations.
>
> There also appears to be a special weight given to non-minimally-encoded
> sequences.  It would seem to me that none of these illegal sequences should
> appear in practice, so we have either:
>
> * A bad encoder spewing out garbage (overlong sequences)
> * Flipped bit(s) due to storage/transmission/whatever errors
> * Lost byte(s) due to storage/transmission/coding/whatever errors
> * Extra byte(s) due to whatever errors
> * Bad string manipulation breaking/concatenating in the middle of
> sequences, causing garbage (perhaps one of the above 2 codeing errors).
>
> Only in the first case, of a bad encoder, are the overlong sequences
> actually "real".  And that shouldn't happen (it's a bad encoder after
> all).  The other scenarios seem just as likely, (or, IMO, much more likely)
> than a badly designed encoder creating overlong sequences that appear to
> fit the UTF-8 pattern but aren't actually UTF-8.
>
> The other cases are going to cause byte patterns that are less "obvious"
> about how they should be navigated for various applications.
>
> I do not understand the energy being invested in a case that shouldn't
> happen, especially in a case that is a subset of all the other bad cases
> that could happen.
>
> -Shawn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170531/9f7f41f4/attachment.html>