UTF-16 Encoding Scheme and U+FFFE

Wed Jun 4 01:54:23 CDT 2014

U+FFFE is prohibited in interchanges because if interchanges specify a
UTF-16 encoding (not UTF16-BE or URF16-LE) it would be interpreted as a BOM
where it occurs at start of a stream (with the consequence of reparsing it
as U+FEFF with bytes swapped). In all other positions where it cannot be a
BOM.

BOM are *normally* only authorized in interchanges at "start" of streams.

But this is a problem for "live" streams that have no defined "start" but
can be synced at random positions (such as on the next newline, or the
start of a network datagram, but note that some network layers may fragment
them so that BOM could be repeated, and also reunite them, leaving multiple
BOMs in the same datagram) so we can assume that U+FFFE anywhere in a UTF16
"live" stream, not a UTF16-BE or UTF16-LE stream, is each time a BOM and
not a BOM or legacy ZWNBSP or a non-character)

Streams that are known to be UTF16-BE or UTF16-LE are also not recommanded
for interchanged if these files or live streams may be transmitted without
metadata specifying its encoding explicitly (so many remote readers will
interpret them instead as UTF16, possibly with multiple BOMs in
resynchronizable live streams).

The problem of live streams is also a good reason why WZNBSP (U+FEFF) has
been strongly discouraged in interchanges in vafor of word joiner (and this
also applies to all other conforming UTFs (including UTF-8, UTF16-BE,
UTF16-LE, UTF32, UTF32-LE, UTF32-BE) where it is strongly recommended not
to use U+FEFF and U+FFFE except for BOMs (possibly repeated on live
streams).

You should note that conforminf processes working in interchanges (or
storage) should always be allowed to switch from one standard UTF to
another. And the same encoded streams may be consumed by various clients
having different native order. It is now become difficult to define what is
a "local" system, when applications are converted to work in a cloud with
more and more heterogeneous clients and more intermediate third parties
(providing things like caching, archiving, proxying, backup of data and
restauration on another system...).

For long term reusability of data, we are strongly encouraged not to use
U+FFFE and U+FEFF except for BOMs, and we should be tolerant about the
number of BOMs found (an in my opinion, UCA implementations should ignore
discard them on input, treating them as fully ignorable, except for
delimiting combining base characters for the prupose of normalisation, that
conforming applications or intermediate filters should be allowed to
perform as they want. And we should absolutely forget the legacy semantic
of ZWNBSP.

But this complexity and tolerance for one or more BOMs also means that all
UTFs not based on 8-bit code units should be also discouraged in
interchanges. This means that UTF-16 and UTF-32 should be discouraged,
leaving only UTF-16BE or UTF-16LE or UTF-32BE not for storage or
networking, but for temporary streams in memory used the
"blackbox" internally implementing each conforming process. For all the
rest, most applications now use UTF-8, possibly packaged within a generic
compressed stream (binary compression of live streams remains possible,
even if you cannot predict in the text encoding where the resynchronization
points will occur: it's up to the protocol using this transport compression
to properly define the resynchronization points).

In UTF-8 streams we can completely omit U+FFFE, U+FEFF, either as BOMs,
ZWNSP or non-characters (and we can also expect that many applications will
just discard them silently, as they only have a "no-op" role as BOMs in
8-bit streams). If an application ouputs an 8-bit stream that is not UTF-8,
it wil drop all U+FEFF and U+FFFE found in input, and will often ouput its
encoding of U+FEFF its non-UTF-8 encoding generated, frequently as a
"magic" signature of this encoding. Secure digital signatures of text
streams should also ignore these code units silently as these code units
won't be relevant elsewhere in the chain of producers or consumers of this
data (these secure digital signatures should be computed by dropping these
discarvable U+FEFF and U+FFFE, normaling that data for example to NFC or
NFD, and producing a specific UTF (the easiest one to avoid complications
being to use UTF-32BE or UTF-32LE with a predetermined byte order, as
specified by the digital signature algorithm).

Additionally it will be very easy to use as many U+FEFF code units as
needed as ignorable extra BOMs, for cases where a protocol needs a safe
"padding filler" f they want to use fixed-size block I/O with random access
and easy resynchronization (in live streams), when the producer safely
breaks data blocks at boundary of combining sequences (allowing these
blocks to be normalized separately and reunited later witout creating
problem.

2014-06-04 1:50 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Tue, 3 Jun 2014 21:28:05 +0000
> Peter Constable <petercon at microsoft.com> wrote:
>
> > There's never been anything preventing a file from containing and
> > beginning with U+FFFE. It's just not a very useful thing to do, hence
> > not very likely.
>
> Well, while U+FFFE was apparently prohibited from public interchange,
> one could be very confident of not finding it in an external file.  As
> an internally generated file, it would then be much more likely to be
> in the UTF-16BE or UTF-16LE encoding scheme.
>
> Richard.
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140604/bc56d98f/attachment.html>