Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Markus Scherer via Unicode unicode at unicode.org
Fri May 26 10:41:32 CDT 2017


On Fri, May 26, 2017 at 3:28 AM, Martin J. Dürst <duerst at it.aoyama.ac.jp>
wrote:

> But there's plenty in the text that makes it absolutely clear that some
> things cannot be included. In particular, it says
>
> >>>>
> The term “maximal subpart of an ill-formed subsequence” refers to the code
> units that were collected in this manner. They could be the start of a
> well-formed sequence, except that the sequence lacks the proper
> continuation. Alternatively, the converter may have found an continuation
> code unit, which cannot be the start of a well-formed sequence.
> >>>>
>
> And the "in this manner" refers to:
> >>>>
> A sequence of code units will be processed up to the point where the
> sequence either can be unambiguously interpreted as a particular Unicode
> code point or where the converter recognizes that the code units collected
> so far constitute an ill-formed subsequence.
> >>>>
>
> So we have the same thing twice: Bail out as soon as something is
> ill-formed.


The UTF-8 conversion code that I wrote for ICU, and apparently the code
that various other people have written, collects sequences starting from
lead bytes, according to the original spec, and at the end looks at whether
the assembled code point is too low for the lead byte, or is a surrogate,
or is above 10FFFF. Stopping at a non-trail byte is quite natural, and
reading the PRI text accordingly is quite natural too.

Aside from UTF-8 history, there is a reason for preferring a more
"structural" definition for UTF-8 over one purely along valid sequences.
This applies to code that *works* on UTF-8 strings rather than just
converting them. For UTF-8 *processing* you need to be able to iterate both
forward and backward, and sometimes you need not collect code points while
skipping over n units in either direction -- but your iteration needs to be
consistent in all cases. This is easier to implement (especially in fast,
short, inline code) if you have to look only at how many trail bytes follow
a lead byte, without having to look whether the first trail byte is in a
certain range for some specific lead bytes.

(And don't say that everyone can validate all strings once and then all
code can assume they are valid: That just does not work for library code,
you cannot assume anything about your input strings, and you cannot crash
when they are ill-formed.)

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170526/d4366ba1/attachment.html>


More information about the Unicode mailing list