Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?
Doug Ewell via Unicode
unicode at unicode.org
Mon Jul 24 10:50:24 CDT 2017
Costello, Roger L. wrote:
> Suppose an application splits a UTF-8 multi-octet sequence. The
> application then sends the split sequence to a client. The client must
> restore the original sequence.
> Question: is it possible to split a UTF-8 multi-octet sequence in such
> a way that the client cannot unambiguously restore the original
1. (Bug) The folding process inserts CRLF plus white space characters,
and the unfolding process doesn't properly delete all of them.
2. (Non-conformant behavior) Some process, after folding and before
unfolding, attempts to interpret the partial UTF-8 sequences and
converts them into replacement characters or worse.
In a minimally decent implementation, splitting and reassembling a UTF-8
sequence should always yield the correct result; there should be no
A good implementation, of course, would know the character encoding of
the data, and would not split multi-byte sequences in that encoding to
Doug Ewell | Thornton, CO, US | ewellic.org
More information about the Unicode