Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

Philippe Verdy verdy_p at wanadoo.fr
Thu Jun 5 02:41:07 CDT 2014


2014-06-05 0:48 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> If you are processing arbitrary fragments of a stream, without knowledge
> of preceding fragments, as in this example, then you have no business
> making *any* changes to that fragment based on interpretation of that
> fragment as Unicode text. Your sole responsibilities at that point are
> to pass the fragments, intact, from one process to the next, or to
> disassemble and reassemble them.


Not necessarily true.

You can easily think about the debugging log coming from an OS or device
and accumulating text data coming from various sources in the device. Then
you can connect to a live stream at any time without necessarily following
all what happened before.
You'll probably want to sync on the first newline control and then proceed
from that point. But now if you have those devices configured
heterogenously and generating their own output encoding you won't
necessarily know how it is encoded even uf all of them use some UTF of
Unicode. So the stream will regularly repost an encoding mark, for exampel
at the begining of each dated logged entry, and this could be just an
encoded BOM (even with UTF-8, or some other UTF like UTF-16 which would be
more likely if the language contained essentially an East-Asian (CJK)
language.

These devices would emit these messages or logs with a very basic protocol,
or no protocol at all (Telnet, serial link, ...) without any pror
negociation (these data feeds are unidirectional meant to be used by any
number of consumers that can connect or disconnct from them at any time,
the log producer will never know how many clients there are, notably for
passive debugging logs)

You could then expect BOMs to occur many times in the stream (this is what
I called a "live" stream : it has no start, no end, no defined total size,
you don't know when new texts will be emitted, you don't even know at which
rate; which could be very huge : if the rate is too high one can use a fast
local proxy to filter the feed with patterns (e.g. a debug level, reported
in the start of line of each log entry, or some identifier of the real
source, not controlled drctly at the point of connection where you connect
to listen the stream) and hear only the result that can be supported over a
slower link to the client. But here also the proxy will not necesarily work
continuously but only when there will be some interested client for it and
providing a pattern matching. The resulted texts will then be highly
fragmented.

So your assumption if only true when you think about processes that have a
prior agreement to use some specific convention. But in an heterogeous
world here participants (prodicers and consumers) and maintained separately
and can appear or disappear at any time, you cannot expect they will all
use the same encoding and that disassembling/reassembling is as safe as
what you think. This is only true if they work in close cooperation under
strict common standards.

Take na eample of a service that would archive all received emails in a
feed or a list of SMS messages from a group of participants; do you need to
archive not only the texts them selves but also all the protocol meta data
fro which they originated when the application is creating a baic log which
will not be used by SMS or emails due to the generated volume?

Encoded texts in heterogenous environement and over the web where people
could use various OSes and languages are well known examples where
plain-text is not always sufficient to determine how to devide it, you
cannot just "guess" fro mthe content when this content can change at any
time. And these texts are not always safely convertible to the same
encoding without data losses or alterations. If you don't insert in the
live stream enough BOMs after some resynchronization points, the result
that consumers will ger will be full of mojibake.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140605/c3414bc8/attachment.html>


More information about the Unicode mailing list