Running out of code points, redux (was: Re: Feedback on the proposal...)
Ken Whistler via Unicode
unicode at unicode.org
Thu Jun 1 23:52:11 CDT 2017
On 6/1/2017 8:32 PM, Richard Wordingham via Unicode wrote:
> TUS Section 3 is like the Augean Stables. It is a complete mess as a
> standards document,
That is a matter of editorial taste, I suppose.
> imputing mental states to computing processes.
That, however, is false. The rhetorical turn in the Unicode Standard's
conformance clauses, "A process shall interpret..." and "A process shall
not interpret..." has been in the standard for 21 years, and seems to
have done its general job in guiding interoperable, conformant
implementations fairly well. And everyone -- well, perhaps almost
everyone -- has been able to figure out that such wording is a shorthand
for something along the lines of "Any person implementing software
conforming to the Unicode Standard in which a process does X shall
implement it in such a way that that process when doing X shall follow
the specification part Y, relevant to doing X, exactly according to that
specification of Y...", rather than a misguided assumption that software
processes are cognitive agents equipped with mental states that the
standard can "tell what to think".
And I contend that the shorthand works just fine.
> Table 3-7 for example, should be a consequence of a 'definition' that
> UTF-8 only represents Unicode Scalar values and excludes 'non-shortest
Well, Definition D92 does already explicitly limit UTF-8 to Unicode
scalar values, and explicitly limits the form to sequences of one to
four bytes. The reason why it doesn't explicitly include the exclusion
of "non-shortest form" in the definition, but instead refers to Table
3-7 for the well-formed sequences (which, btw explicitly rule out all
the non-shortest forms), is because that would create another
terminological conundrum -- trying to specify an air-tight definition of
"non-shortest form (of UTF-8)" before UTF-8 itself is defined. It is
terminologically cleaner to let people *derive* non-shortest form from
the explicit exclusions of Table 3-7.
> Instead, the exclusion of the sequence <ED A0 80> is presented
> as a brute definition, rather than as a consequence of 0xD800 not being
> a Unicode scalar value. Likewise, 0xFC fails to be legal because it
> would define either a 'non-shortest form' or a value that is not a
> Unicode scalar value.
Actually 0xFC fails quite simply and unambiguously, because it is not in
Table 3-7. End of story.
Same for 0xFF. There is nothing architecturally special about
0xF5..0xFF. All are simply and unambiguously excluded from any
well-formed UTF-8 byte sequence.
> The differences are a matter of presentation; the outcome as to what is
> permitted is the same. The difference lies rather in whether the rules
> are comprehensible. A comprehensible definition is more likely to be
> implemented correctly. Where the presentation makes a difference is in
> how malformed sequences are naturally handled.
Well, I don't think implementers have all that much trouble figuring out
what *well-formed* UTF-8 is these days.
As for "how malformed sequences are naturally handled", I can't really
say. Nor do I think the standard actually requires any particular
handling to be conformant. It says thou shalt not emit them, and if you
encounter them, thou shalt not interpret them as Unicode characters.
Beyond that, it would be nice, of course, if people converged their
error handling for malformed sequences in cooperative ways, but there is
no conformance statement to that effect in the standard.
I have no trouble with the contention that the wording about "best
practice" and "recommendations" regarding the handling of U+FFFD has
caused some confusion and differences of interpretation among
implementers. I'm sure the language in that area could use cleanup,
precisely because it has led to contending, incompatible interpretations
of the text. As to what actually *is* best practice in use of U+FFFD
when attempting to convert ill-formed sequences handed off to UTF-8
conversion processes, or whether the Unicode Standard should attempt to
narrow down or change practice in that area, I am completely agnostic.
Back to the U+FFFD thread for that discussion.
More information about the Unicode