Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

Doug Ewell via Unicode unicode at unicode.org
Mon Jul 24 10:50:24 CDT 2017


Costello, Roger L. wrote:

> Suppose an application splits a UTF-8 multi-octet sequence. The
> application then sends the split sequence to a client. The client must
> restore the original sequence. 
>
> Question: is it possible to split a UTF-8 multi-octet sequence in such
> a way that the client cannot unambiguously restore the original
> sequence? 

1. (Bug) The folding process inserts CRLF plus white space characters,
and the unfolding process doesn't properly delete all of them.

2. (Non-conformant behavior) Some process, after folding and before
unfolding, attempts to interpret the partial UTF-8 sequences and
converts them into replacement characters or worse.

In a minimally decent implementation, splitting and reassembling a UTF-8
sequence should always yield the correct result; there should be no
ambiguity.

A good implementation, of course, would know the character encoding of
the data, and would not split multi-byte sequences in that encoding to
begin with.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org




More information about the Unicode mailing list