Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

Costello, Roger L. via Unicode unicode at unicode.org
Mon Jul 24 12:57:43 CDT 2017


Hi Folks,

Thank you very much for your fantastic comments!

Below I summarized the issue and your comments. At the bottom is a set of proposed requirements (for my clients) on applications that receive iCalendar files.

Some questions:
 
- Have I captured all your comments? Any more comments?
- Are the proposed requirements sensible? Any more requirements? 

/Roger

Issue: Folding and unfolding content lines in iCalendar files

The iCalendar specification [RFC 5545] says that a content line should not be longer than 75 octets:

	Lines of text SHOULD NOT be longer
           	than 75 octets, excluding the line break.
 
The RFC says that long lines should be folded:

	Long content lines SHOULD be split
 	into a multiple line representations
 	using a line "folding" technique.
 	That is, a long line can be split between
 	any two characters by inserting a CRLF
 	immediately followed by a single linear
 	white-space character (i.e., SPACE or HTAB).

The RFC says that, when parsing a content line, folded lines must first be unfolded:

	When parsing a content line, folded lines MUST
 	first be unfolded. 

using this technique:

	Unfolding is accomplished by  removing the
 	CRLF and the linear white-space character
 	that immediately follows. 

The RFC acknowledges that some implementations might do folding in the middle of a multi-octet sequence:

	Note: It is possible for very simple
	implementations to generate improperly
 	folded lines in the middle of a UTF-8
 	multi-octet sequence.  For this reason,
 	implementations need to unfold lines
 	in such a way to properly restore the
 	original sequence. 

Here is an example of folding in the middle of a UTF-8 multi-octet sequence: 

The iCalendar file contains the Yen sign (U+00A5), which is represented by the byte sequence 0xC2 0xA5 in UTF-8. The content line containing the Yen sign is folded in the middle of the two bytes. The result is 0xC2 0x0D 0x0A 0x20 0xA5, which isn't valid UTF-8 any longer.

Proposed requirements on the behavior of applications that receive iCalendar files:

1. (Bug) The receiving application does not recognize that it has received an iCalendar file.

2. (Bug) The sending application performs the folding process - inserts CRLF plus white space characters - and the receiving application does the unfolding process but doesn't properly delete all of them.

3. (Non-conformant behavior) The receiving application, after folding and before unfolding, attempts to interpret the partial UTF-8 sequences and convert them into replacement characters or worse.



More information about the Unicode mailing list