Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

Costello, Roger L. via Unicode unicode at unicode.org
Mon Jul 24 09:39:40 CDT 2017


Hello Unicode Experts!

Suppose an application splits a UTF-8 multi-octet sequence. The application then sends the split sequence to a client. The client must restore the original sequence. 

Question: is it possible to split a UTF-8 multi-octet sequence in such a way that the client cannot unambiguously restore the original sequence?

Here is the source of my question:

The iCalendar specification [RFC 5545] says that long lines must be folded:

	Long content lines SHOULD be split
 	into a multiple line representations
 	using a line "folding" technique.
 	That is, a long line can be split between
 	any two characters by inserting a CRLF
 	immediately followed by a single linear
 	white-space character (i.e., SPACE or HTAB).

The RFC says that, when parsing a content line, folded lines must first be unfolded using this technique:

	Unfolding is accomplished by removing
 	the CRLF and the linear white-space
 	character that immediately follows.

The RFC acknowledges that simple implementations might generate improperly folded lines:

	Note: It is possible for very simple
	implementations to generate improperly
 	folded lines in the middle of a UTF-8
 	multi-octet sequence.  For this reason,
 	implementations need to unfold lines
 	in such a way to properly restore the
 	original sequence.

Can you provide an example of folding a UTF-8 multi-octet sequence such that there is no unambiguous way to restore the original sequence? 

/Roger



More information about the Unicode mailing list