Corrigendum #9

Wed Jul 2 10:02:56 CDT 2014

On 06/12/2014 11:14 PM, Peter Constable wrote:
> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Karl Williamson
> Sent: Wednesday, June 11, 2014 9:30 PM
>
>> I have a something like a library that was written a long time ago
>> (not by me) assuming that noncharacters were illegal in open interchange.
>> Programs that use the library were guaranteed that they would not receive
>> noncharacters in their input.
>
> I haven't read every post in the thread, so forgive me if I'm making incorrect inferences.
>
> I get the impression that you think that Unicode conformance requirements have historically provided that guarantee, and that Corrigendum #9 broke that. If so, then that is a mistaken understanding of Unicode conformance.

Any real-world application dealing with Unicode inputs needs to be 
protected from "bad" inputs.  These can come in the form of malicious 
attacks, or the result of a noisy transmission, or just plain mistakes. 
  It doesn't matter.  Generally, a gatekeeper application is employed to 
furnish this protection, so that the other application doesn't have to 
keep checking things at every turn.  And, since software is expensive to 
write and prone to error, a generic gatekeeper is usually used, shared 
among many applications.  Such a gatekeeper may very well be 
configurable to let through some inputs that would normally be 
considered bad, to accommodate rare special cases.  In UTF-8, an example 
would be that Sun, I'm told, and for reasons I've forgotten or never 
knew, did not want raw NUL bytes to appear in text streams, so used the 
overlong sequence \xC0\x80 to represent them; overlong sequences 
generally being considered "bad" because they could be used to insert 
malicious payloads into the input.

The original wording of the non-character text "should never be 
interchanged" doesn't necessarily indicate that they will never be valid 
in input, but that their appearance there purposely would be something 
quite rare, and a gatekeeper application should default to not passing 
them through.  A protected application could indicate to the gatekeeper 
that it is prepared to handle non-character inputs, but the default 
should be to not accept them.

Corrigendum #9 has changed this so much that people are coming to me and 
saying that inputs may very well have non-characters, and that the 
default should be to pass them through.  Since we have no published 
wording for how the TUS will absorb Corrigendum #9, I don't know how 
this will play out.  But this abrupt a change seems wrong to me, and it 
was done without public input or really adequate time to consider its 
effects.

Non-characters are still designed solely for internal use, and hence I 
think the default for a gatekeeper should still be to exclude them.

>
> Here is what has historically been said in the way of conformance requirements related to non-characters:
>
> TUS 1.0: There were no conformance requirements stated. This recommendation was given:
> "U+FFFF and U+FFFE are reserved and should not be transmitted or stored."
>
> This same recommendation was repeated in later versions. However, it must be recognized that "should" statements are never absolute requirements.
>
> Conformance requirements first appeared in TUS 2.0:
>
> TUS 2.0, TUS 3.0:
> "C5	A process shall not interpret either U+FFFE or U+FFFF as an abstract character."
>
>
> TUS 4.0:
> "C5	A process shall not interpret a noncharacter code point as an abstract character."
>
> "C10	When a process purports not to modify the interpretation of a valid coded character representation, it shall make no change to that coded character representation other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points."
>
> Btw, note that C10 makes the assumption that a valid coded character sequence can include non-character code points.
>
>
> TUS 5.0 (trivially different from TUS4.0):
> C2 = TUS4.0, C5
>
> "C7	When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points."
>
>
> TUS 6.0:
> C2 = TUS5.0, C2
>
> "C7	When a process purports not to modify the interpretation of a valid coded character
> sequence, it shall make no change to that coded character sequence other than the possible
> replacement of character sequences by their canonical-equivalent sequences."
>
> Interestingly, the change to C7 does not permit non-characters to be replaced or removed at all while claiming not to have left the interpretation intact.
>
>
> So, there was a change in 6.0 that could impact conformance claims of existing implementations. But there has never been any guarantees made _by Unicode_ that non-character code points will never occur in open interchange. Interchange has always been discouraged, but never prohibited.
>
>
>
>
> Peter
>