Corrigendum #9

Wed Jun 11 23:29:53 CDT 2014

On 06/02/2014 09:48 AM, Markus Scherer wrote:
> On Mon, Jun 2, 2014 at 8:27 AM, Doug Ewell <doug at ewellic.org
> <mailto:doug at ewellic.org>> wrote:
>
>     I suspect everyone can agree on the edge cases, that noncharacters are
>     harmless in internal processing, but probably should not appear in
>     random text shipped around on the web.
>
>
> Right, in principle. However, it should be ok to include noncharacters
> in CLDR data files for processing by CLDR implementations, and it should
> be possible to edit and diff and version-control and web-view those
> files etc.
>
> It seems that trying to define "interchange" and "public" in ways that
> satisfy everyone will not be successful.
>
> The FAQ already gives some examples of where noncharacters might be
> used, should be preserved, or could be stripped, starting with "Q: Are
> noncharacters intended for interchange?
> <http://www.unicode.org/faq/private_use.html#nonchar6>"
>
> In my view, those Q/A pairs explain noncharacters quite well. If there
> are further examples of where noncharacters might be used, should be
> preserved, or could be stripped, and that would be particularly useful
> to add to the examples already there, then we could add them.
>
> markus
>
>

I was unaware of this FAQ.  Having read it and re-read this entire 
thread, I am still troubled.

I have a something like a library that was written a long time ago (not 
by me) assuming that noncharacters were illegal in open interchange. 
Programs that use the library were guaranteed that they would not 
receive noncharacters in their input.  They thus are free to use any 
noncharacter internally as they wish.  Now that Corrigendum #9 has come 
out, I'm getting requests to update the library to not reject 
noncharacters.  The library itself does not use noncharacters.  If I (or 
someone else) makes the requested change, it may silently cause security 
holes in those programs that were depending on it doing the rejection, 
and who upgrade to use the new version. Some of these programs may have 
been written many years ago.  The original authors are now dead in some 
instances, or have turned the code over to someone else, or haven't 
thought about it in years.  The current maintainers of those programs 
may be unaware of this dependence, and hence may upgrade without 
realizing the consequences.  Further, the old versions of the library 
will soon be unsupported, so there is pressure to upgrade to get bug 
fixes and the promise of future support.  This means there could be 
security holes that a hacker who gets a hold of the source can exploit.

I don't see anything in the FAQ that really addresses this situation.  I 
think there should be an answer that addresses code written before the 
Corrigendum, and that goes into detail about the security issues. My 
guess is that the UTC did not really consider the potential for security 
holes when making this Corrigendum.

I agree that CLDR should be able to use noncharacters for internal 
processing, and that they should be able to be stored in files and 
edited.  But I believe that version control systems and editors have 
just as much right to use noncharacters for their internal purposes.  I 
disagree with the FAQ that seems to say if you write a utility you 
should avoid using noncharacters in its implementation.  It might be 
that competitive pressure, or just that the particular implementations 
don't need non-characters, would cause some such utilities to accept 
some or all non-characters as inputs.  But If I were writing such code, 
I can see now how using noncharacters for my purposes would be quite 
convenient.  CLDR could be considered to be a utility, and its users 
might want to use noncharacters for their purposes.  Is CLDR constructed 
so there is no potential for conflicts here?  That is, does it reserve 
certain noncharacters for its own use?

The FAQ talks about how various now-noncharacter code points were touted 
as sentinel candidates in earlier Unicode versions, and that they are no 
longer so.  But it really should emphasize that old code may very well 
want to continue to use them as sentinels.  The answer "Well, the short 
answer is no, that is not true—at least, not entirely true."  is 
misleading in this regard.

The FAQ mentions using 0x7FFFFFFF as a possible sentinel.  I did not 
realize that that was considered representable in any UTF.  Likewise -1.