public at khwilliamson.com
Wed Jun 11 23:29:53 CDT 2014
On 06/02/2014 09:48 AM, Markus Scherer wrote:
> On Mon, Jun 2, 2014 at 8:27 AM, Doug Ewell <doug at ewellic.org
> <mailto:doug at ewellic.org>> wrote:
> I suspect everyone can agree on the edge cases, that noncharacters are
> harmless in internal processing, but probably should not appear in
> random text shipped around on the web.
> Right, in principle. However, it should be ok to include noncharacters
> in CLDR data files for processing by CLDR implementations, and it should
> be possible to edit and diff and version-control and web-view those
> files etc.
> It seems that trying to define "interchange" and "public" in ways that
> satisfy everyone will not be successful.
> The FAQ already gives some examples of where noncharacters might be
> used, should be preserved, or could be stripped, starting with "Q: Are
> noncharacters intended for interchange?
> In my view, those Q/A pairs explain noncharacters quite well. If there
> are further examples of where noncharacters might be used, should be
> preserved, or could be stripped, and that would be particularly useful
> to add to the examples already there, then we could add them.
I was unaware of this FAQ. Having read it and re-read this entire
thread, I am still troubled.
I have a something like a library that was written a long time ago (not
by me) assuming that noncharacters were illegal in open interchange.
Programs that use the library were guaranteed that they would not
receive noncharacters in their input. They thus are free to use any
noncharacter internally as they wish. Now that Corrigendum #9 has come
out, I'm getting requests to update the library to not reject
noncharacters. The library itself does not use noncharacters. If I (or
someone else) makes the requested change, it may silently cause security
holes in those programs that were depending on it doing the rejection,
and who upgrade to use the new version. Some of these programs may have
been written many years ago. The original authors are now dead in some
instances, or have turned the code over to someone else, or haven't
thought about it in years. The current maintainers of those programs
may be unaware of this dependence, and hence may upgrade without
realizing the consequences. Further, the old versions of the library
will soon be unsupported, so there is pressure to upgrade to get bug
fixes and the promise of future support. This means there could be
security holes that a hacker who gets a hold of the source can exploit.
I don't see anything in the FAQ that really addresses this situation. I
think there should be an answer that addresses code written before the
Corrigendum, and that goes into detail about the security issues. My
guess is that the UTC did not really consider the potential for security
holes when making this Corrigendum.
I agree that CLDR should be able to use noncharacters for internal
processing, and that they should be able to be stored in files and
edited. But I believe that version control systems and editors have
just as much right to use noncharacters for their internal purposes. I
disagree with the FAQ that seems to say if you write a utility you
should avoid using noncharacters in its implementation. It might be
that competitive pressure, or just that the particular implementations
don't need non-characters, would cause some such utilities to accept
some or all non-characters as inputs. But If I were writing such code,
I can see now how using noncharacters for my purposes would be quite
convenient. CLDR could be considered to be a utility, and its users
might want to use noncharacters for their purposes. Is CLDR constructed
so there is no potential for conflicts here? That is, does it reserve
certain noncharacters for its own use?
The FAQ talks about how various now-noncharacter code points were touted
as sentinel candidates in earlier Unicode versions, and that they are no
longer so. But it really should emphasize that old code may very well
want to continue to use them as sentinels. The answer "Well, the short
answer is no, that is not true—at least, not entirely true." is
misleading in this regard.
The FAQ mentions using 0x7FFFFFFF as a possible sentinel. I did not
realize that that was considered representable in any UTF. Likewise -1.
More information about the Unicode