Corrigendum #9

Richard Wordingham richard.wordingham at ntlworld.com
Wed Jun 25 12:58:55 CDT 2014


On Tue, 24 Jun 2014 09:16:00 -0400
CE Whitehead <cewcathar at hotmail.com> wrote:

> ME: if two sequences are canonically equivalent except that one has
> noncharacters in it, are these still canonically equivalent?

Canonical equivalences are defined for all sequences of scalar values;
it is just that it changes from version to version for most unassigned
characters.

Non-characters only decompose to themselves and do not
occur in the canonical (or indeed compatibility) decomposition of
anything else, so a sequence containing a non-character cannot be
canonically equivalent to a seqeunce not containing a non-character.

> Regarding the sentinels; I am an outsider but assume that with
> Corrigendum 9 U+FFFE will continue to be mentioned as having
> generally (not always?) standard use throughout; in Chapter 16.7 it
> is currently mentioned; I assume it will still be -- according to
> info. in the FAQ and elsewhere:
> http://www.unicode.org/faq/private_use.html "U+FFFE. The 16-bit
> unsigned hexadecimal value U+FFFE is not a Unicode character value,
> and should be taken as a signal that Unicode characters should be
> byte-swapped before interpretation. U+FFFE should only be intepreted
> as an incorrectly byte-swapped version of U+FEFF" 

There is a lot of untruth in that FAQ entry, alas.  I think U+FFFE
and possibly U+FFFF should be treated differently to the other 64
non-characters.  At present there is no certainty as to whether
an interchanged file in the UTF-16 encoding scheme that appears to
contain a BOM contains a BOM or starts with U+FFFE.  The only
promise is that such a file contains an even number of data bytes.
Any such sequence is valid!  Will the UTF-16 encoding scheme be
withdrawn?

Richard.


More information about the Unicode mailing list