Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

Steffen Nurpmeso via Unicode unicode at unicode.org
Mon Jul 24 10:01:42 CDT 2017


"Costello, Roger L. via Unicode" <unicode at unicode.org> wrote:
 |Suppose an application splits a UTF-8 multi-octet sequence. The application \
 |then sends the split sequence to a client. The client must restore \
 |the original sequence. 
 |
 |Question: is it possible to split a UTF-8 multi-octet sequence in such \
 |a way that the client cannot unambiguously restore the original sequence?
 |
 |Here is the source of my question:
 |
 |The iCalendar specification [RFC 5545] says that long lines must be folded:
 |
 | Long content lines SHOULD be split
 |  into a multiple line representations
 |  using a line "folding" technique.
 |  That is, a long line can be split between
 |  any two characters by inserting a CRLF
 |  immediately followed by a single linear
 |  white-space character (i.e., SPACE or HTAB).
 |
 |The RFC says that, when parsing a content line, folded lines must first \
 |be unfolded using this technique:
 |
 | Unfolding is accomplished by removing
 |  the CRLF and the linear white-space
 |  character that immediately follows.
 |
 |The RFC acknowledges that simple implementations might generate improperly \
 |folded lines:
 |
 | Note: It is possible for very simple
 | implementations to generate improperly
 |  folded lines in the middle of a UTF-8
 |  multi-octet sequence.  For this reason,
 |  implementations need to unfold lines
 |  in such a way to properly restore the
 |  original sequence.

That is not what the RFC says.  It says that simple
implementations simply split lines when the limit is reached,
which might be in the middle of an UTF-8 sequence.  The RFC is
thus improved compared to other RFCs in the email standard
section, which do not give any hints on how to do that.  Even
RFC 2231, which avoids many of the ambiguities and problems of RFC
2047 (for a different purpose, but still), does not say it so
exactly for the reversing character set conversion (which i for
one perform _once_ after joining together the chunks, but is not
a written word and, thus, ...).

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


More information about the Unicode mailing list