Corrigendum #9

Karl Williamson public at khwilliamson.com
Fri May 30 13:26:18 CDT 2014


I'm having a problem with this
http://www.unicode.org/versions/corrigendum9.html

Some people now think it means that noncharacters are really no 
different from private-use characters, and should be treated very 
similarly if not identically.

It seems to me that they should be illegal in open interchange, or 
perhaps illegal in interchange without prior agreement.

Any system (process or group of related, cooperating processes) that 
uses noncharacters will want to not have any of the ones it uses present 
in its inputs.  It will want to filter them out of those inputs, likely 
turning each into a REPLACEMENT CHARACTER.  If it fails to do that, it 
leaves itself vulnerable to an attack by hackers, who can fool it into 
thinking the input data is different from what it really is.

Hence, a system that creates outputs containing noncharacters cannot be 
assured that any other system will accept those noncharacters.

Thus, I don't see how noncharacters can be considered to be valid in 
public interchange, given that the producers have to assume that the 
consumers will not accept them.  Producers can assume that consumers 
will accept private-use characters, though they may not know their intent.

I think the text in 6.2 section 16.7 is good and does not need to be 
changed: "Noncharacters ... are forbidden for use in open interchange of 
Unicode text data"

Perhaps a bit better wording would be, "are forbidden for use in 
interchange of Unicode text data without prior agreement"

The only reason I can think of for your too-large (in my opinion) 
backing away from what TUS has said about noncharacters since their 
inception is to accommodate processes that conform to C7, "that purports 
to not modify the interpretation of a valid coded character sequence". 
But, I think there is a better way to do that than what Corrigendum #9 
currently says.

I also am curious as to why the consecutive group of 32 noncharacters 
can't be split off into its own block instead of being part of an Arabic 
one.  I'm unaware of any stability policy forbidding this.  Another 
block is to be split, if I recall correctly, to accommodate the new 
Cherokee characters.


More information about the Unicode mailing list