Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Tue May 30 05:55:47 CDT 2017

Hello Markus, others,

On 2017/05/27 00:41, Markus Scherer wrote:
> On Fri, May 26, 2017 at 3:28 AM, Martin J. Dürst <duerst at it.aoyama.ac.jp>
> wrote:
> 
>> But there's plenty in the text that makes it absolutely clear that some
>> things cannot be included. In particular, it says
>>
>>>>>>
>> The term “maximal subpart of an ill-formed subsequence” refers to the code
>> units that were collected in this manner. They could be the start of a
>> well-formed sequence, except that the sequence lacks the proper
>> continuation. Alternatively, the converter may have found an continuation
>> code unit, which cannot be the start of a well-formed sequence.
>>>>>>
>>
>> And the "in this manner" refers to:
>>>>>>
>> A sequence of code units will be processed up to the point where the
>> sequence either can be unambiguously interpreted as a particular Unicode
>> code point or where the converter recognizes that the code units collected
>> so far constitute an ill-formed subsequence.
>>>>>>
>>
>> So we have the same thing twice: Bail out as soon as something is
>> ill-formed.
> 
> 
> The UTF-8 conversion code that I wrote for ICU, and apparently the code
> that various other people have written, collects sequences starting from
> lead bytes, according to the original spec, and at the end looks at whether
> the assembled code point is too low for the lead byte, or is a surrogate,
> or is above 10FFFF. Stopping at a non-trail byte is quite natural,

I think nobody is debating that this is *one way* to do things, and that 
some code does it.

> and
> reading the PRI text accordingly is quite natural too.

So you are claiming that you're covered because you produce an FFFD 
"where the converter recognizes that the code units collected so far 
constitute an ill-formed subsequence", except that your converter is a 
bit slow in doing that recognition?

Well, I guess I could come up with another converter that would be even 
slower at recognizing that the code units collected so far constitute an 
ill-formed subsequence. Would that still be okay in your view?

And please note that your "just a bit slow" interpretation might somehow 
work for Unicode 5.2, but it doesn't work for Unicode 9.0, because over 
the years, things have been tightened up, and the standard now makes it 
perfectly clear that C0 by itself is a maximal subpart of an ill-formed 
subsequence. From Section 3.9 of 
http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf:

 >>>>
Applying the definition of maximal subparts
for these ill-formed subsequences, in the first case <C0> is a maximal
subpart, because that byte value can never be the first byte of a 
well-formed UTF-8 sequence.
 >>>>

> Aside from UTF-8 history, there is a reason for preferring a more
> "structural" definition for UTF-8 over one purely along valid sequences.

There may be all kinds of reasons for doing things one way or another. 
But there are good reasons why the current recommendation is in place, 
and there are even better reasons for not suddenly reversing it to 
something completely different.

> This applies to code that *works* on UTF-8 strings rather than just
> converting them. For UTF-8 *processing* you need to be able to iterate both
> forward and backward, and sometimes you need not collect code points while
> skipping over n units in either direction -- but your iteration needs to be
> consistent in all cases. This is easier to implement (especially in fast,
> short, inline code) if you have to look only at how many trail bytes follow
> a lead byte, without having to look whether the first trail byte is in a
> certain range for some specific lead bytes.
> 
> (And don't say that everyone can validate all strings once and then all
> code can assume they are valid: That just does not work for library code,
> you cannot assume anything about your input strings, and you cannot crash
> when they are ill-formed.)

[rest of mail mostly OT]

Well, different libraries may make different choices. As an example, the 
Ruby programming language does essentially that: Whenever it finds an 
invalid string, it raises an exception.

Not all processing on all kinds of invalid strings immediately raises an 
exception (because of efficiency considerations). But there are quite 
strong expectations that this happens soon. As an example, when I 
extended case conversion from ASCII only to Unicode (see e.g. 
http://www.sw.it.aoyama.ac.jp/2016/pub/RubyKaigi/, 
http://www.sw.it.aoyama.ac.jp/2016/pub/IUC40-Ruby2.4/), I had to go back 
and fix some things because there were explicit tests checking that 
invalid inputs would raise exceptions.

At least for Ruby, this policy of catching problems early rather than 
allowing garbage-in-garbage-out has worked well.

> markus

Regards,   Martin.