Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

Steffen Nurpmeso via Unicode unicode at unicode.org
Sat Oct 13 11:50:19 CDT 2018


Philippe Verdy via Unicode wrote in <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+\
w9+jEARW4Ghyk8hg at mail.gmail.com>:
 |You forget that Base64 (as used in MIME) does not follow these rules \
 |as it allows multiple different encodings for the same source binary. \
 |MIME actually 
 |splits a binary object into multiple fragments at random positions, \
 |and then encodes these fragments separately. Also MIME uses an extension \
 |of Base64 
 |where it allows some variations in the encoding alphabet (so even the \
 |same fragment of the same length may have two disting encodings).
 |
 |Base64 in MIME is different from standard Base64 (which never splits \
 |the binary object before encoding it, and uses a strict alphabet of \
 |64 ASCII 
 |characters, allowing no variation). So MIME requires special handling: \
 |the assumpton that a binary message is encoded the same is wrong, but \
 |MIME still 
 |requires that this non unique Base64 encoding will be decoded back \
 |to the same initial (unsplitted) binary object (independantly of its \
 |size and 
 |independantly of the splitting boundaries used in the transport, which \
 |may change during the transport).

Base64 is defined in RFC 2045 (Multipurpose Internet Mail
Extensions (MIME) Part One: Format of Internet Message Bodies).
It is a content-transfer-encoding and encodes any data
transparently into a 7 bit clean ASCII _and_ EBCDIC compatible
(the authors commemorate that) text.
When decoding it reverts this representation into its original form.
Ok, there is the CRLF newline problem, as below.
What do you mean by "splitting"?

...
The only variance is described as:

  Care must be taken to use the proper octets for line breaks if base64
  encoding is applied directly to text material that has not been
  converted to canonical form.  In particular, text line breaks must be
  converted into CRLF sequences prior to base64 encoding.  The
  important thing to note is that this may be done directly by the
  encoder rather than in a prior canonicalization step in some
  implementations.

This is MIME, it specifies (in the same RFC):

  2.10.  Lines

   "Lines" are defined as sequences of octets separated by a CRLF
   sequences.  This is consistent with both RFC 821 and RFC 822.
   "Lines" only refers to a unit of data in a message, which may or may
   not correspond to something that is actually displayed by a user
   agent.

and furthermore

  6.5.  Translating Encodings

   The quoted-printable and base64 encodings are designed so that
   conversion between them is possible.  The only issue that arises in
   such a conversion is the handling of hard line breaks in quoted-
   printable encoding output. When converting from quoted-printable to
   base64 a hard line break in the quoted-printable form represents a
   CRLF sequence in the canonical form of the data. It must therefore be
   converted to a corresponding encoded CRLF in the base64 form of the
   data.  Similarly, a CRLF sequence in the canonical form of the data
   obtained after base64 decoding must be converted to a quoted-
   printable hard line break, but ONLY when converting text data.

So we go over

  6.6.  Canonical Encoding Model

   There was some confusion, in the previous versions of this RFC,
   regarding the model for when email data was to be converted to
   canonical form and encoded, and in particular how this process would
   affect the treatment of CRLFs, given that the representation of
   newlines varies greatly from system to system, and the relationship
   between content-transfer-encodings and character sets.  A canonical
   model for encoding is presented in RFC 2049 for this reason.

to RFC 2049 where we find

         For example, in the case of text/plain data, the text
          must be converted to a supported character set and
          lines must be delimited with CRLF delimiters in
          accordance with RFC 822.  Note that the restriction on
          line lengths implied by RFC 822 is eliminated if the
          next step employs either quoted-printable or base64
          encoding.

and, later

   Conversion from entity form to local form is accomplished by
   reversing these steps. Note that reversal of these steps may produce
   differing results since there is no guarantee that the original and
   final local forms are the same.

and, later

   NOTE: Some confusion has been caused by systems that represent
   messages in a format which uses local newline conventions which
   differ from the RFC822 CRLF convention.  It is important to note that
   these formats are not canonical RFC822/MIME.  These formats are
   instead *encodings* of RFC822, where CRLF sequences in the canonical
   representation of the message are encoded as the local newline
   convention.  Note that formats which encode CRLF sequences as, for
   example, LF are not capable of representing MIME messages containing
   binary data which contains LF octets not part of CRLF line separation
   sequences.

Whoever understands this emojibake.
My MUA still gnaws at antiquated structures (i am too lazy), but
in quoted-printable we encode CRLF in the raw text to "=0D=0A=",
i.e., a trailing soft line break so that data is decoded as plain
CRLF again.  Something like that it should be i think.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


More information about the Unicode mailing list