Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?
Costello, Roger L. via Unicode
unicode at unicode.org
Mon Jul 24 09:39:40 CDT 2017
Hello Unicode Experts!
Suppose an application splits a UTF-8 multi-octet sequence. The application then sends the split sequence to a client. The client must restore the original sequence.
Question: is it possible to split a UTF-8 multi-octet sequence in such a way that the client cannot unambiguously restore the original sequence?
Here is the source of my question:
The iCalendar specification [RFC 5545] says that long lines must be folded:
Long content lines SHOULD be split
into a multiple line representations
using a line "folding" technique.
That is, a long line can be split between
any two characters by inserting a CRLF
immediately followed by a single linear
white-space character (i.e., SPACE or HTAB).
The RFC says that, when parsing a content line, folded lines must first be unfolded using this technique:
Unfolding is accomplished by removing
the CRLF and the linear white-space
character that immediately follows.
The RFC acknowledges that simple implementations might generate improperly folded lines:
Note: It is possible for very simple
implementations to generate improperly
folded lines in the middle of a UTF-8
multi-octet sequence. For this reason,
implementations need to unfold lines
in such a way to properly restore the
Can you provide an example of folding a UTF-8 multi-octet sequence such that there is no unambiguous way to restore the original sequence?
More information about the Unicode