UTF-16 Encoding Scheme and U+FFFE

Richard Wordingham richard.wordingham at ntlworld.com
Wed Jun 4 14:01:48 CDT 2014

On Wed, 4 Jun 2014 00:23:53 +0000
"Whistler, Ken" <ken.whistler at sap.com> wrote:

> You cannot even be "very confident" of not finding actual ill-formed
> UTF-16, like unpaired surrogates, in an external file, let alone
> noncharacters.

I though unpaired surrogates were normally mojibake, broken
characters, or sabotage attempts.

> Any one of those test strings could be
> trivially turned into a text file by piping out that one UTF-16
> string to a file.

At that point, you should be in detailed control of the Unicode encoding
scheme.  Also, would not the system be using one of UTF16 with byte
order marks, UTF-16BE and UTF-16LE?

> And I could then write conformant test software
> that would read UTF-16 string input data from that file and run it
> through the UCA algorithm to construct sortkeys for it.

Given the number of control characters in that file, I wouldn't be
confident of getting the output back the same as it went out unless the
input were controlled at a binary level.

> As Peter said, the main thing that prevents running into these is
> that it isn't very *useful* to start off files (or strings) with

Actually, for sorting records using the CLDR collation algorithm, it
may be very useful to use U+FFFE as a field separator.  If the most
significant field for sorting is sometimes empty (e.g. surname in a list
of contacts), then the field separator could very easily be the first
non-BOM character after sorting.  I suppose one had better use
something like <COMMA, U+FFFE> as a field separator instead.

> (And, additionally, in the case of UTF-16 text data files, it
> would be confusing and possibly lead to misinterpretation of byte
> order, if you were somehow depending solely on initial BOMs -- which
> I wouldn't advise, anyway.)

Interesting.  Goodbye UTF-16 encoding scheme and hello automatic
encoding detection.  I'm not sure how automatic detection is supposed
to work with a file consisting of just a test string from the
collation test. 

> Basically, the rules of standards (e.g., you shouldn't try to
> publicly interchange noncharacters) are not like laws of
> physics. Just because the standard says you shouldn't do
> it doesn't mean it doesn't happen.

Just as theft happens.


More information about the Unicode mailing list