Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

Adam Borowski via Unicode unicode at unicode.org
Sat Oct 13 20:39:04 CDT 2018


On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote:
> Le sam. 13 oct. 2018 à 18:58, Steffen Nurpmeso via Unicode <
> unicode at unicode.org> a écrit :
> > The only variance is described as:
> >
> >   Care must be taken to use the proper octets for line breaks if base64
> >   encoding is applied directly to text material that has not been
> >   converted to canonical form.  In particular, text line breaks must be
> >   converted into CRLF sequences prior to base64 encoding.  The
> >   important thing to note is that this may be done directly by the
> >   encoder rather than in a prior canonicalization step in some
> >   implementations.
> >
> > This is MIME, it specifies (in the same RFC):
> 
> I've not spoken aboutr the encoding of new lines **in the actual encoded
> text**:
> -  if their existing text-encoding ever gets converted to Base64 as if the
> whole text was an opaque binary object, their initial text-encoding will be
> preserved (so yes it will preserve the way these embedded newlines are
> encoded as CR, LF, CR+LF, NL...)
> 
> I spoke about newlines used in the transport syntax to split the initial
> binary object (which may actually contain text but it does not matter).
> MIME defines this operation and even requires splitting the binary object
> in fragments with maximum binary size so that these binary fragments can be
> converted with Base64 into lines with maximum length. In the MIME Base64
> representation you can insert newlines anywhere between fragments encoded
> separately.

There's another kind of fragmentation that can make the encoding differ (but
still decode to the same payload):

The data stream gets split into 3-byte internal, 4-byte external packets.
Any packet may contain less than those 3 bytes, in which cases it is padded
with = characters:
3 bytes XXXX
2 bytes XXX=
1 byte  XX==

Usually, such smaller packets happen only at the end of a message, but to
support encoding a stream piecewise, they are allowed at any point.

For example:
"meow"     is bWVvdw==
"me""ow"   is bWU=b3c=
yet both carry the same payload.

> Base64 is used exactly to support this flexibility in transport (or
> storage) without altering any bit of the initial content once it is
> decoded.

Right, any such variations are in packaging only.


ᛗᛖᛟᚹ
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 10 people enter a bar: 1 who understands binary,
⢿⡄⠘⠷⠚⠋⠀ 1 who doesn't, D who prefer to write it as hex,
⠈⠳⣄⠀⠀⠀⠀ and 1 who narrowly avoided an off-by-one error.


More information about the Unicode mailing list