Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)
asmusf at ix.netcom.com
Wed Jun 4 13:40:11 CDT 2014
On 6/4/2014 11:26 AM, Doug Ewell wrote:
> Sorry, I left out an important detail.
> I wrote:
>> 3. U+FEFF at the beginning of a stream (note: not "packet" or
>> arbitrary cutoff point)
> I meant U+FEFF as a zero-width no-break space. Obviously it is very
> common to see U+FEFF as a signature or BOM.
> My underlying question here is, how common is it that the producer of a
> stream actually intends this character *at the start of a stream* to be
> a ZWNBSP, not to be stripped lest the actual text content be altered?
The semantics of it were chosen at the time to make no sense at the
start, and to make the character invisible in most situations. The
remnant of its semantic was later taken up by Word Joiner, so that there
is now NO use for this as part of text.
The use as part of a convention has always been clear. If you stick this
at the front, readers will byte-reverse your data; that should weed out
accidental use pretty quickly :) Or prevent people from getting "cute"
with it in other ways.
So, I would think that for this particular code point, you can safely
assume that it's buggy or test data.
Buggy data you just byte reverse as requested and let the user take the
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell
> Unicode mailing list
> Unicode at unicode.org
More information about the Unicode