Running out of code points, redux (was: Re: Feedback on the proposal...)

Thu Jun 1 23:52:11 CDT 2017

On 6/1/2017 8:32 PM, Richard Wordingham via Unicode wrote:
> TUS Section 3 is like the Augean Stables.  It is a complete mess as a
> standards document,

That is a matter of editorial taste, I suppose.

> imputing mental states to computing processes.

That, however, is false. The rhetorical turn in the Unicode Standard's 
conformance clauses, "A process shall interpret..." and "A process shall 
not interpret..." has been in the standard for 21 years, and seems to 
have done its general job in guiding interoperable, conformant 
implementations fairly well. And everyone -- well, perhaps almost 
everyone -- has been able to figure out that such wording is a shorthand 
for something along the lines of "Any person implementing software 
conforming to the Unicode Standard in which a process does X shall 
implement it in such a way that that process when doing X shall follow 
the specification part Y, relevant to doing X, exactly according to that 
specification of Y...", rather than a misguided assumption that software 
processes are cognitive agents equipped with mental states that the 
standard can "tell what to think".

And I contend that the shorthand works just fine.

>
> Table 3-7 for example, should be a consequence of a 'definition' that
> UTF-8 only represents Unicode Scalar values and excludes 'non-shortest
> forms'.

Well, Definition D92 does already explicitly limit UTF-8 to Unicode 
scalar values, and explicitly limits the form to sequences of one to 
four bytes. The reason why it doesn't explicitly include the exclusion 
of "non-shortest form" in the definition, but instead refers to Table 
3-7 for the well-formed sequences (which, btw explicitly rule out all 
the non-shortest forms), is because that would create another 
terminological conundrum -- trying to specify an air-tight definition of 
"non-shortest form (of UTF-8)" before UTF-8 itself is defined. It is 
terminologically cleaner to let people *derive* non-shortest form from 
the explicit exclusions of Table 3-7.

> Instead, the exclusion of the sequence <ED A0 80> is presented
> as a brute definition, rather than as a consequence of 0xD800 not being
> a Unicode scalar value. Likewise, 0xFC fails to be legal because it
> would define either a 'non-shortest form' or a value that is not a
> Unicode scalar value.

Actually 0xFC fails quite simply and unambiguously, because it is not in 
Table 3-7. End of story.

Same for 0xFF. There is nothing architecturally special about 
0xF5..0xFF. All are simply and unambiguously excluded from any 
well-formed UTF-8 byte sequence.

>
> The differences are a matter of presentation; the outcome as to what is
> permitted is the same.  The difference lies rather in whether the rules
> are comprehensible.  A comprehensible definition is more likely to be
> implemented correctly.  Where the presentation makes a difference is in
> how malformed sequences are naturally handled.

Well, I don't think implementers have all that much trouble figuring out 
what *well-formed* UTF-8 is these days.

As for "how malformed sequences are naturally handled", I can't really 
say. Nor do I think the standard actually requires any particular 
handling to be conformant. It says thou shalt not emit them, and if you 
encounter them, thou shalt not interpret them as Unicode characters. 
Beyond that, it would be nice, of course, if people converged their 
error handling for malformed sequences in cooperative ways, but there is 
no conformance statement to that effect in the standard.

I have no trouble with the contention that the wording about "best 
practice" and "recommendations" regarding the handling of U+FFFD has 
caused some confusion and differences of interpretation among 
implementers. I'm sure the language in that area could use cleanup, 
precisely because it has led to contending, incompatible interpretations 
of the text. As to what actually *is* best practice in use of U+FFFD 
when attempting to convert ill-formed sequences handed off to UTF-8 
conversion processes, or whether the Unicode Standard should attempt to 
narrow down or change practice in that area, I am completely agnostic. 
Back to the U+FFFD thread for that discussion.

--Ken