Corrigendum #9

Asmus Freytag asmusf at ix.netcom.com
Fri May 30 13:49:00 CDT 2014


On 5/30/2014 11:26 AM, Karl Williamson wrote:
> I'm having a problem with this
> http://www.unicode.org/versions/corrigendum9.html

You are not alone.
>
> Some people now think it means that noncharacters are really no 
> different from private-use characters, and should be treated very 
> similarly if not identically.
>
> It seems to me that they should be illegal in open interchange, or 
> perhaps illegal in interchange without prior agreement.
>
> Any system (process or group of related, cooperating processes) that 
> uses noncharacters will want to not have any of the ones it uses 
> present in its inputs.  It will want to filter them out of those 
> inputs, likely turning each into a REPLACEMENT CHARACTER. If it fails 
> to do that, it leaves itself vulnerable to an attack by hackers, who 
> can fool it into thinking the input data is different from what it 
> really is.
>
> Hence, a system that creates outputs containing noncharacters cannot 
> be assured that any other system will accept those noncharacters.
>
> Thus, I don't see how noncharacters can be considered to be valid in 
> public interchange, given that the producers have to assume that the 
> consumers will not accept them.  Producers can assume that consumers 
> will accept private-use characters, though they may not know their 
> intent.

This is an important distinction.

One of the concerns was that people felt that they had to have "data 
pipeline" style implementations (tools) go and filter these out - even 
if there was no intent for the implementation to use them internally in 
any way. Making clear that the standard does not require filtering 
allows for cleaner implementations of such ("path through) tools.

However, like you, I feel that the corrigendum went to far.
>
> I think the text in 6.2 section 16.7 is good and does not need to be 
> changed: "Noncharacters ... are forbidden for use in open interchange 
> of Unicode text data"
>
> Perhaps a bit better wording would be, "are forbidden for use in 
> interchange of Unicode text data without prior agreement"
>
> The only reason I can think of for your too-large (in my opinion) 
> backing away from what TUS has said about noncharacters since their 
> inception is to accommodate processes that conform to C7, "that 
> purports to not modify the interpretation of a valid coded character 
> sequence". But, I think there is a better way to do that than what 
> Corrigendum #9 currently says.
>
> I also am curious as to why the consecutive group of 32 noncharacters 
> can't be split off into its own block instead of being part of an 
> Arabic one.  I'm unaware of any stability policy forbidding this.  
> Another block is to be split, if I recall correctly, to accommodate 
> the new Cherokee characters.

This might have been possible at the time these were added, but now it 
is probably not feasible. One of the reasons is that block names are 
exposed (for better or for worse) as character properties and as such 
are also exposed in regular expressions. While not recommended, it would 
be really bad if the expression with pseudo-code 
"IsInArabicPresentationFormB(x)" were to fail, because we split the 
block into three (with the middle one being the noncharacters).

It's the usual dance: is it better to prevent such breakage, or is it 
better to not pile up more "exceptions" like noncharacters being filed 
under Arabic Presentation forms. The damage from the former is direct 
and immediate and eventually decays. The damage from the latter is 
subtle and cumulative over time.

Tough choice.

A./
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>



More information about the Unicode mailing list