UTF-16 Encoding Scheme and U+FFFE

Tue Jun 3 19:23:53 CDT 2014

You cannot even be "very confident" of not finding actual ill-formed
UTF-16, like unpaired surrogates, in an external file, let alone
noncharacters.

As for the noncharacters, take a look at the collation test files
that we distribute with each version of UCA. The test data includes
test strings like the following, to verify that UCA implementations
do the correct thing when faced with unusual edge cases:

FFFE 0021
FFFE 003F
FFFE 0061
FFFE 0041
FFFE 0062
1FFFE 0021
1FFFE 003F
1FFFE 0334
...

As well as test strings starting with unpaired surrogates:

D800 0021
D800 003F
D800 0061
D800 0041
D800 0062

And while it is true that the *file* CollationTest_SHIFTED.txt doesn't
start with either a noncharacter or an unpaired surrogate -- because
all of the test data in it is represented in ASCII hex strings instead of
directly in UTF-16 -- the issue in any case isn't whether a *file* starts
with a noncharacter, but whether a UTF-16 *string* starts with a
noncharacter. Any one of those test strings could be trivially turned
into a text file by piping out that one UTF-16 string to a file. And I
could then write conformant test software that would read UTF-16
string input data from that file and run it through the UCA algorithm
to construct sortkeys for it.

As Peter said, the main thing that prevents running into these is
that it isn't very *useful* to start off files (or strings) with U+FFFE. (And,
additionally, in the case of UTF-16 text data files, it would be
confusing and possibly lead to misinterpretation of byte order,
if you were somehow depending solely on initial BOMs -- which
I wouldn't advise, anyway.)

Basically, the rules of standards (e.g., you shouldn't try to
publicly interchange noncharacters) are not like laws of
physics. Just because the standard says you shouldn't do
it doesn't mean it doesn't happen.

--Ken

> On Tue, 3 Jun 2014 21:28:05 +0000
> Peter Constable <petercon at microsoft.com> wrote:
> 
> > There's never been anything preventing a file from containing and
> > beginning with U+FFFE. It's just not a very useful thing to do, hence
> > not very likely.
> 
> Well, while U+FFFE was apparently prohibited from public interchange,
> one could be very confident of not finding it in an external file.  As
> an internally generated file, it would then be much more likely to be
> in the UTF-16BE or UTF-16LE encoding scheme.
> 
> Richard.