Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?
Costello, Roger L. via Unicode
unicode at unicode.org
Mon Jul 24 12:57:43 CDT 2017
Thank you very much for your fantastic comments!
Below I summarized the issue and your comments. At the bottom is a set of proposed requirements (for my clients) on applications that receive iCalendar files.
- Have I captured all your comments? Any more comments?
- Are the proposed requirements sensible? Any more requirements?
Issue: Folding and unfolding content lines in iCalendar files
The iCalendar specification [RFC 5545] says that a content line should not be longer than 75 octets:
Lines of text SHOULD NOT be longer
than 75 octets, excluding the line break.
The RFC says that long lines should be folded:
Long content lines SHOULD be split
into a multiple line representations
using a line "folding" technique.
That is, a long line can be split between
any two characters by inserting a CRLF
immediately followed by a single linear
white-space character (i.e., SPACE or HTAB).
The RFC says that, when parsing a content line, folded lines must first be unfolded:
When parsing a content line, folded lines MUST
first be unfolded.
using this technique:
Unfolding is accomplished by removing the
CRLF and the linear white-space character
that immediately follows.
The RFC acknowledges that some implementations might do folding in the middle of a multi-octet sequence:
Note: It is possible for very simple
implementations to generate improperly
folded lines in the middle of a UTF-8
multi-octet sequence. For this reason,
implementations need to unfold lines
in such a way to properly restore the
Here is an example of folding in the middle of a UTF-8 multi-octet sequence:
The iCalendar file contains the Yen sign (U+00A5), which is represented by the byte sequence 0xC2 0xA5 in UTF-8. The content line containing the Yen sign is folded in the middle of the two bytes. The result is 0xC2 0x0D 0x0A 0x20 0xA5, which isn't valid UTF-8 any longer.
Proposed requirements on the behavior of applications that receive iCalendar files:
1. (Bug) The receiving application does not recognize that it has received an iCalendar file.
2. (Bug) The sending application performs the folding process - inserts CRLF plus white space characters - and the receiving application does the unfolding process but doesn't properly delete all of them.
3. (Non-conformant behavior) The receiving application, after folding and before unfolding, attempts to interpret the partial UTF-8 sequences and convert them into replacement characters or worse.
More information about the Unicode