Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Doug Ewell via Unicode unicode at unicode.org
Wed May 17 20:48:59 CDT 2017


Richard Wordingham wrote:

>> I'm afraid I don't get the analogy.
>
> You can't build a full Unicode system out of Unicode-compliant parts.

Others will have to address Richard's point about canonical-equivalent 
sequences.

> However, having dug out Unicode Version 2 Appendix A Section 2 UTF-8
> (in http://www.unicode.org/versions/Unicode2.0.0/appA.pdf), I find the
> critical wording, "When converting from UTF-8 to Unicode values,
> however, implementations do not need to check that the shortest
> encoding is being used,...". There was no prohibition on
> implementations performing the check, so whether C0 80 would be
> interpreted as U+0000 or as an error was unpredictable.

So it is as I said, and as TUS said before Corrigendum #1 was approved, 
more than 16 years ago: It was not legal to create overlong sequences, 
but implementations were allowed to interpret any that they came across.

As someone who pays attention to the fine details, you will certainly 
appreciate the difference between "it was once legal to encode NUL as E0 
80 80" and "it was once legal for a decoder to interpret the sequence E0 
80 80 as NUL instead of rejecting it."

--
Doug Ewell | Thornton, CO, US | ewellic.org 



More information about the Unicode mailing list