Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?
Doug Ewell via Unicode
unicode at unicode.org
Mon Jul 24 10:50:24 CDT 2017
Costello, Roger L. wrote:
> Suppose an application splits a UTF-8 multi-octet sequence. The
> application then sends the split sequence to a client. The client must
> restore the original sequence.
>
> Question: is it possible to split a UTF-8 multi-octet sequence in such
> a way that the client cannot unambiguously restore the original
> sequence?
1. (Bug) The folding process inserts CRLF plus white space characters,
and the unfolding process doesn't properly delete all of them.
2. (Non-conformant behavior) Some process, after folding and before
unfolding, attempts to interpret the partial UTF-8 sequences and
converts them into replacement characters or worse.
In a minimally decent implementation, splitting and reassembling a UTF-8
sequence should always yield the correct result; there should be no
ambiguity.
A good implementation, of course, would know the character encoding of
the data, and would not split multi-byte sequences in that encoding to
begin with.
--
Doug Ewell | Thornton, CO, US | ewellic.org
More information about the Unicode
mailing list