Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Fri May 26 05:28:36 CDT 2017

On 2017/05/25 09:22, Markus Scherer wrote:
> On Wed, May 24, 2017 at 3:56 PM, Karl Williamson <public at khwilliamson.com>
> wrote:
>
>> On 05/24/2017 12:46 AM, Martin J. Dürst wrote:
>>
>>> That's wrong. There was a public review issue with various options and
>>> with feedback, and the recommendation has been implemented and in use
>>> widely (among else, in major programming language and browsers) without
>>> problems for quite some time.
>>>
>>
>> Could you supply a reference to the PRI and its feedback?
>>
>
> http://www.unicode.org/review/resolved-pri-100.html#pri121
>
> The PRI did not discuss possible different versions of "maximal subpart",
> and the examples there yield the same results either way. (No non-shortest
> forms.)

It is correct that it didn't give any of the *examples* that are under 
discussion now. On the other hand, the PRI is very clear about what it 
means by "maximal subpart":

Citing directly from the PRI:

 >>>>
The term "maximal subpart of the ill-formed subsequence" refers to the 
longest potentially valid initial subsequence or, if none, then to the 
next single code unit.
 >>>>

At the time of the PRI, so-called "overlongs" were already ill-formed.

That change goes back to 2003 or earlier (RFC 3629 
(https://tools.ietf.org/html/rfc3629) was published in 2003 to reflect 
the tightening of the UTF-8 definition in Unicode/ISO 10646).

> The recommendation in TUS 5.2 is "Replace each maximal subpart of an
>> ill-formed subsequence by a single U+FFFD."
>>
>
> You are right.
>
> http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf shows a slightly
> expanded example compared with the PRI.
>
> The text simply talked about a "conversion process" stopping as soon as it
> encounters something that does not fit, so these edge cases would depend on
> whether the conversion process treats original-UTF-8 sequences as single
> units.

No, the text, both in the PRI and in Unicode 5.2, is quite clear. The 
"does not fit" (which I haven't found in either text) is clearly 
grounded by "ill-formed UTF-8". And there's no question about what 
"ill-formed UTF-8" means, in particular in Unicode 5.2, where you just 
have to go two pages back to find byte sequences such as <C0 AF>, <E0 9F 
80>, and <F4 80 83 92> all called out explicitly as ill-formed.

Any kind of claim, as in the L2/17-168 document, about there being an 
option 2a, are just not substantiated. It's true that there are no 
explicit examples in the PRI that would allow to distinguish between 
converting e.g.
FC BF BF BF BF 80
to a single FFFD or to six of these. But there's no need to have 
examples for every corner case if the text is clear enough. In the above 
six-byte sequence, there's not a single potentially valid (initial) 
subsequence, so it's all single code units.

>> And I agree with that.  And I view an overlong sequence as a maximal
>> ill-formed subsequence

Can you point to any definition that would include or allow such an 
interpretation? I just haven't found any yet, neither in the PRI nor in 
Unicode 5.2.

>> that should be replaced by a single FFFD. There's
>> nothing in the text of 5.2 that immediately follows that recommendation
>> that indicates to me that my view is incorrect.

I have to agree that the text in Unicode 5.2 could be clearer. It's a 
hodgepodge of attempts at justifications and definitions. And the word 
"maximal" itself may also contribute to pushing the interpretation in 
one direction.

But there's plenty in the text that makes it absolutely clear that some 
things cannot be included. In particular, it says

 >>>>
The term “maximal subpart of an ill-formed subsequence” refers to the 
code units that were collected in this manner. They could be the start 
of a well-formed sequence, except that the sequence lacks the proper 
continuation. Alternatively, the converter may have found an 
continuation code unit, which cannot be the start of a well-formed sequence.
 >>>>

And the "in this manner" refers to:
 >>>>
A sequence of code units will be processed up to the point where the 
sequence either can be unambiguously interpreted as a particular Unicode 
code point or where the converter recognizes that the code units 
collected so far constitute an ill-formed subsequence.
 >>>>

So we have the same thing twice: Bail out as soon as something is 
ill-formed.

>> Perhaps my view is colored by the fact that I now maintain code that was
>> written to parse UTF-8 back when overlongs were still considered legal
>> input.

Thanks for providing this information. That's a lot more useful than 
"feels right", which was given as a reason on this list before.

>> An overlong was a single unit.  When they became illegal, the code
>> still considered them a single unit.

That's fine for your code. I might do the same (or not) if I were you, 
because one indeed never knows in which situation some code is used, and 
what repercussions a change might produce.

But the PRI, and the wording in Unicode 5.2, was created when overlongs 
and 5-byte and 6-byte sequences and surrogate pairs,... were very 
clearly ill-formed already. If these texts had intended to make an 
exception for any of these cases, it would clearly have had to be 
written into the actual text.

Saying something like "the text isn't clear because it says ill-formed, 
but maybe it doesn't mean ill-formed at the time it was written, but 
quite a few years before" just doesn't make sense to me at all.

> I can understand how someone who comes along later could say C0 can't be
>> followed by any continuation character that doesn't yield an overlong,
>> therefore C0 is a maximal subsequence.

Yes indeed, because maximal subsequences are defined by reference to 
well-formed/ill-formed subsequences, and what's ill-formed is defined in 
the same standard at the same time.

There's nobody "coming along later". That kind of wording would be 
appropriate if the PRI and the recommendation in the standard had been 
made e.g. in the 1990ies, before the tightening of the UTF-8 definition. 
Then somebody could say that Unicode overlooked that they implicitly 
changed the recommendation for how to generate U+FFFDs by changing the 
definition of well-formed UTF-8.

But no such thing at all happened. The PRI was evaluated, and the 
recommendation included in the text of Unicode, in the context of the 
then-existing (and since then unchanged) definition of UTF-8.

>> But I assert that my interpretation is just as valid as that one.

Sorry, but it cannot be valid, because of the timing. The tightening of 
the UTF-8 definition happened years before the PRI.

>> And perhaps more so, because of historical precedent.

It's good to know that there are older implementations that behave 
differently. And I understand that in some cases, these might be 
reluctant to change. Mine, and Henri's, comments are very much motivated 
by the fact that we are reluctant to change our implementations.

It may be worth to think about whether the Unicode standard should 
mention implementations like yours. But there should be no doubt about 
the fact that the PRI and Unicode 5.2 (and the current version of 
Unicode) are clear about what they recommend, and that that 
recommendation is based on the definition of UTF-8 at that time (and 
still in force), and not at based on a historical definition of UTF-8.

Regards,   Martin.