Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

Richard Wordingham richard.wordingham at ntlworld.com
Thu Jun 5 12:40:09 CDT 2014


On Thu, 5 Jun 2014 09:41:07 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> You'll probably want to sync on the first newline control and then
> proceed from that point. But now if you have those devices configured
> heterogenously and generating their own output encoding you won't
> necessarily know how it is encoded even uf all of them use some UTF of
> Unicode. So the stream will regularly repost an encoding mark, for
> exampel at the begining of each dated logged entry, and this could be
> just an encoded BOM (even with UTF-8, or some other UTF like UTF-16
> which would be more likely if the language contained essentially an
> East-Asian (CJK) language.

Of course, this is not an arbitrary fragment.  In this location, ZWNBSP
will have almost no effect.  (The only mechanisms I can think of are
character counts and the text being pasted immediately after another
word.)  This, and the early belief that U+FFFE would not occur in
Unicode text, are why it was chosen.

Richard.


More information about the Unicode mailing list