Corrigendum #9

Sat May 31 04:02:34 CDT 2014

On Fri, 30 May 2014 12:26:18 -0600
Karl Williamson <public at khwilliamson.com> wrote:

> I'm having a problem with this
> http://www.unicode.org/versions/corrigendum9.html

> Some people now think it means that noncharacters are really no 
> different from private-use characters, and should be treated very 
> similarly if not identically.

> It seems to me that they should be illegal in open interchange, or 
> perhaps illegal in interchange without prior agreement.

So one just puts a notice on the web site saying that by downloading
CLDR files one agrees to accept non-characters.  Part of the original
problem is that the CLDR mechanism for identifying Unicode scalar
values in XML rather than quoting them (albeit by numeric entities) was
broken.

> Thus, I don't see how noncharacters can be considered to be valid in 
> public interchange, given that the producers have to assume that the 
> consumers will not accept them.

The publishing of the CLDR data was strictly limited to the Milky Way,
and will remain so for several decades at the very least.  Therefore it
was not public interchange.

Practically, there is the very real issue that a system may be useful
enough to be used as part of a larger system, and therefore called
upon to handle any Unicode scalar value.  One possible solution is to
use, instead of non-characters, lone low surrogates.  These have the
advantage of having obvious representations for use with all three
coding forms. Of course, internal checks on the well-formedness of
Unicode strings would have to be relaxed, and one might prefer to use
them doubled in UTF-16 so as not to weaken checks for broken strings.

Richard.