Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

Philippe Verdy verdy_p at wanadoo.fr
Thu Jun 5 18:23:34 CDT 2014


2014-06-05 21:46 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
>
> > Not necessarily true.
> >
> > [602 words]
>
> This has nothing to do with the scenario I described, which involved
> removing a "BOM" from the start of an arbitrary fragment of data,
> thereby corrupting the data because the "BOM" was actually a ZWNBSP.
>
> If you have an arbitrary fragment of data, don't fiddle with it.
>

Thisis your scenario. The simple concept of a unique "start" of text does
not exist in live streams that can start anywhere. So you cannot always
expect that U+FEFF or U+FFFE will only exist once in a strram and
necessaryly at the start of position where you can start reading it because
you may already be past the initial creation of the stream without having
any wya to come back to the "start".

Your assumption just assumes that you can always "rewind" your file, this
is not always possible and each user of that start has its own start
differnt from the other one. And this is not becuse they are "fiddling"
into it. Many applications use internally such one-way streams that have no
random access capability so that they cann cannot be rewinded to the
"start".

And the producer does not keep a complete log of everything that was
emitted. Clients are just connecting to the stream from a position where
the producer is already using which is already past the start seen by the
producer. In some cases there are even multiple producers contributng
independantly to the stream (debug log streams are typical examples, but
this could also be a live text stream of subtitles in a live TV or radio
channel with a single producer for many consumers connecting to the never
ending stream at any time without possibility to rewind back in time
possibly months or years before to get the full stream just in order to
process thousands of gigabytes of audio or video where the live text stream
has been multiplexed).

Now you will argue: this live stream is not plain text, it has a binary
structure. Yes but only if your consumer application wants to process the
full multiplex. Typically clients will demultiplex the stream and pass it
down to a simpler client that absolutely does not care about the transport
multiplex format. If that downward client is just used to display the
incoming text, it will just wait for text that will be buffered ine by line
and displayed immediately where there's a newline separator. But even in
this case, each line may have been fragmented so that each fragment will
contain a leading BOM which will nto be necessarily stripped (notably not
if the transport is made with datagrams over a non "reliable" protocol like
UDP (you have also incorrectly asuumed that a text stream is necessaily
transported over a "reliable" protocol like TCP where there can be no data
loss in the middle, i.e. you are still bound to classic storage on a file
system (even if this file system is named "HTTP": even in HTTP there also
exists live streams without any defined start).

Texts are inhernetly fragmentable. Initially they are transcripts of human
communication and nobody in real life is permanently connected to someone
else and able to remember eveything that was said by someone else.
Fragmented texts are natural and have always existed even before they were
ritten on a material support. On a numeric network, text is dematerialized
again and are materialized only by consumers, you don't transmit the
bounded support. The concept of "start" of text is in fact very artificial,
this is not the wa people interact between each other or in groups.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140606/adf78712/attachment.html>


More information about the Unicode mailing list