Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

Philippe Verdy via Unicode unicode at unicode.org
Mon Jul 24 10:27:09 CDT 2017


But at the same time that RFC makes a direct reference as UTF-8 as being
the default charset, so an implementation of the RFC cannot be agnostic to
what is UTF-8 and will not break in the middle of a conforming UTF-8
sequence.

When the limit is reached, that implementations knows that it cannot cut at
a position of an UTF-8 trailing byte, and knows that it can safely
rollaback at most 3 bytes before to locate conforming leading UTF-8 byte to
split the line **before** it, or any 7-bit ASCII byte to split the line
just **after** it). This requires very small buffering and this is a
fundamendal property of UTF-8.

Other character sets -- including /UTF-(16|32)([LB]E)?/ !!! --- are not
directly supported, except by external decoders which would convert their
input stream to UTF-8 (with all the same issues that may occur for such
conversion when it is not roundtrip compatible or the input does not
conform the specificvation of the input charset, but this is not the
problem of this RFC: these decoders may also rollback internally or attempt
to guess another charset or will use substitution, but they are supposed to
generate conforming UTF-8 on output).


2017-07-24 17:01 GMT+02:00 Steffen Nurpmeso via Unicode <unicode at unicode.org
>:

> "Costello, Roger L. via Unicode" <unicode at unicode.org> wrote:
>  |Suppose an application splits a UTF-8 multi-octet sequence. The
> application \
>  |then sends the split sequence to a client. The client must restore \
>  |the original sequence.
>  |
>  |Question: is it possible to split a UTF-8 multi-octet sequence in such \
>  |a way that the client cannot unambiguously restore the original sequence?
>  |
>  |Here is the source of my question:
>  |
>  |The iCalendar specification [RFC 5545] says that long lines must be
> folded:
>  |
>  | Long content lines SHOULD be split
>  |  into a multiple line representations
>  |  using a line "folding" technique.
>  |  That is, a long line can be split between
>  |  any two characters by inserting a CRLF
>  |  immediately followed by a single linear
>  |  white-space character (i.e., SPACE or HTAB).
>  |
>  |The RFC says that, when parsing a content line, folded lines must first \
>  |be unfolded using this technique:
>  |
>  | Unfolding is accomplished by removing
>  |  the CRLF and the linear white-space
>  |  character that immediately follows.
>  |
>  |The RFC acknowledges that simple implementations might generate
> improperly \
>  |folded lines:
>  |
>  | Note: It is possible for very simple
>  | implementations to generate improperly
>  |  folded lines in the middle of a UTF-8
>  |  multi-octet sequence.  For this reason,
>  |  implementations need to unfold lines
>  |  in such a way to properly restore the
>  |  original sequence.
>
> That is not what the RFC says.  It says that simple
> implementations simply split lines when the limit is reached,
> which might be in the middle of an UTF-8 sequence.  The RFC is
> thus improved compared to other RFCs in the email standard
> section, which do not give any hints on how to do that.  Even
> RFC 2231, which avoids many of the ambiguities and problems of RFC
> 2047 (for a different purpose, but still), does not say it so
> exactly for the reversing character set conversion (which i for
> one perform _once_ after joining together the chunks, but is not
> a written word and, thus, ...).
>
> --steffen
> |
> |Der Kragenbaer,                The moon bear,
> |der holt sich munter           he cheerfully and one by one
> |einen nach dem anderen runter  wa.ks himself off
> |(By Robert Gernhardt)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170724/e3bdece5/attachment.html>


More information about the Unicode mailing list