Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

Richard Wordingham richard.wordingham at ntlworld.com
Wed Jun 4 14:21:03 CDT 2014


On Wed, 04 Jun 2014 11:40:11 -0700
Asmus Freytag <asmusf at ix.netcom.com> wrote:

> On 6/4/2014 11:26 AM, Doug Ewell wrote:

> > I meant U+FEFF as a zero-width no-break space. Obviously it is very
> > common to see U+FEFF as a signature or BOM.

> The semantics of it were chosen at the time to make no sense
> at the start, and to make the character invisible in most situations.
> The remnant of its semantic was later taken up by Word Joiner, so that
> there is now NO use for this as part of text.
 
> The use as part of a convention has always been clear. If you stick
> this at the front, readers will byte-reverse your data; that should
> weed out accidental use pretty quickly :) Or prevent people from
> getting "cute" with it in other ways.

Wrong!  If you stick U+FEFF at the start of a file, expect it to be
stripped.  If you stick U+FFFE at the start of a file, then expect to
see the rest of the text to be byte-reversed.

> So, I would think that for this particular code point, you can safely 
> assume that it's buggy or test data.

The example that's usually given is that of a text file sliced into
segments to avoid file size limits.  In these cases, there is the risk
that U+FEFF as ZWNBSP will wind up at the start of a segment and be
stripped.  The solution using the Windows command window is to perform a
*binary* concatenation of the segments; if one doesn't, newlines will
be inserted between the segments, which is much severer damage.

Richard.


More information about the Unicode mailing list