Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

Sat Oct 13 18:37:35 CDT 2018

Le sam. 13 oct. 2018 à 18:58, Steffen Nurpmeso via Unicode <
unicode at unicode.org> a écrit :

> Philippe Verdy via Unicode wrote in <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+\
> w9+jEARW4Ghyk8hg at mail.gmail.com>:
>  |You forget that Base64 (as used in MIME) does not follow these rules \
>  |as it allows multiple different encodings for the same source binary. \
>  |MIME actually
>  |splits a binary object into multiple fragments at random positions, \
>  |and then encodes these fragments separately. Also MIME uses an extension
> \
>  |of Base64
>  |where it allows some variations in the encoding alphabet (so even the \
>  |same fragment of the same length may have two disting encodings).
>  |
>  |Base64 in MIME is different from standard Base64 (which never splits \
>  |the binary object before encoding it, and uses a strict alphabet of \
>  |64 ASCII
>  |characters, allowing no variation). So MIME requires special handling: \
>  |the assumpton that a binary message is encoded the same is wrong, but \
>  |MIME still
>  |requires that this non unique Base64 encoding will be decoded back \
>  |to the same initial (unsplitted) binary object (independantly of its \
>  |size and
>  |independantly of the splitting boundaries used in the transport, which \
>  |may change during the transport).
>
> Base64 is defined in RFC 2045 (Multipurpose Internet Mail
> Extensions (MIME) Part One: Format of Internet Message Bodies).
> It is a content-transfer-encoding and encodes any data
> transparently into a 7 bit clean ASCII _and_ EBCDIC compatible
> (the authors commemorate that) text.
> When decoding it reverts this representation into its original form.
> Ok, there is the CRLF newline problem, as below.
> What do you mean by "splitting"?
>
> ...
> The only variance is described as:
>
>   Care must be taken to use the proper octets for line breaks if base64
>   encoding is applied directly to text material that has not been
>   converted to canonical form.  In particular, text line breaks must be
>   converted into CRLF sequences prior to base64 encoding.  The
>   important thing to note is that this may be done directly by the
>   encoder rather than in a prior canonicalization step in some
>   implementations.
>
> This is MIME, it specifies (in the same RFC):

I've not spoken aboutr the encoding of new lines **in the actual encoded
text**:
-  if their existing text-encoding ever gets converted to Base64 as if the
whole text was an opaque binary object, their initial text-encoding will be
preserved (so yes it will preserve the way these embedded newlines are
encoded as CR, LF, CR+LF, NL...)

I spoke about newlines used in the transport syntax to split the initial
binary object (which may actually contain text but it does not matter).
MIME defines this operation and even requires splitting the binary object
in fragments with maximum binary size so that these binary fragments can be
converted with Base64 into lines with maximum length. In the MIME Base64
representation you can insert newlines anywhere between fragments encoded
separately.

The maximum size of fragment is not fixed (it is usually about 60 binary
octets, that are converted to lines of 80 ASCII characters, followed by a
newline (CR+LF is strongly suggested for MIME, but it is admitted to use
other newline sequences). Email forwarding agents frequently needed these
line lengths to process the mail properly (not just the MIME headers but as
well the content body, where they want at least some whitespace or newline
in the middle where they can freely rearrange the line lines by compressing
whitespaces or splitting lines to shorter length as necessary to their
processing; this is much less frequent today because most mail agents are
8-bit clean and allow arbitrary line lengths... except in MIME headers).

In MIME headers the situation is different, there's really a maximum
line-length there, and if a header is too long, it has to be split on
multiple lines (using continuation sequences, i.e. a newline (CR+LF is
standard here) followed by at least one space (this
insertion/change/removal of whitespaces is permitted everywhere in the MIME
header after the header type, but even before the colon that follows the
header type). So a MIME header value whose included text gets encoded with
Base64 will be split using "=?" sequences starting the indication that the
fragment is Base64 encoded (instead of being QuotedPrintable-encoded) and
then a separator and the encapsulated Base-64 encoding of a fragment, and a
single header may have multiple Base64-encoded fragments in the same header
value, and there's large freedom about where to split the value to isolate
fragments with convenient size that satisfies the MIME requirements. These
multiple fragemetns may then occur on the same line (separated by
whitespace) or on multiple line (separated by continuation sequences).

In that case, the same initial text can have multiple valid representation
in a MIME envelope format using Base64: it is not Base64 itself that splits
the message, but the MIME transport syntax (which itself does not alter the
initial text-encoding of the initial text... except in parts that are NOT
binary-encoded using Base64 or QuotedPrintable).

We are in a case where Base64 is not applied uniquely, because it is driven
not by the actual transported text, but by the MIME transport syntax, and
MIME allows freely changing the Base64 fragment sizes (or even switch to
another encoding) as long as it preserves the binary value of the embedded
object, and also to change the text-encoding (UTF-8, ISO 8859-*, etc.) if
encoded fragments are identified to actually contain text (this does not
apply to content bodies, unless they are declared with a "text/*" MIME type
in the headers; but this applies for known headers whose value is
necessarily a text type (such as in headers with types "From:", "To:",
"Cc:", "Subject:", "Date:" ...)

MIME defines two distinct syntaxes, one for declaration headers, another
for content bodies. Each one can use Base64 encoding and split the content
(but differently).

HTTP also has a mechanism for splitting a large body into fragments (this
allows notably to create streaming protocols where fragments can be easily
multiplexed with parallel streams, or to include digital fingerprints or
security signatures for individual fragments to secure the stream. This
fragmentation is independant of the network transport (generally TCP, but
not only) which has its own transparent MTUs at session layer, link layers,
and also can be itself be encapsulated through tunnels transported by other
means with different MTUs and fragmentation : HTTP does not have to manage
that lower layer).

Both MIME (for mails) and HTTP define allowed transformations to drive how
Base64 will be used. Both have enough flexibility to allow variable
fragment sizes, and even allow them to be changed as needed for the
transport (this is challending for data signatures of the exchanged
contents, but both MIME and HTTP can safely preserve the content without
breaking these signatures in the middle): the recipient may not recieve
exactly the same Base-64 encoded message, but it will get the same message
content (once it is Base64 decoded)

Base64 is used exactly to support this flexibility in transport (or
storage) without altering any bit of the initial content once it is decoded.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181014/2803c1ce/attachment.html>