Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?
Adam Borowski via Unicode
unicode at unicode.org
Sat Oct 13 20:39:04 CDT 2018
On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote:
> Le sam. 13 oct. 2018 à 18:58, Steffen Nurpmeso via Unicode <
> unicode at unicode.org> a écrit :
> > The only variance is described as:
> >
> > Care must be taken to use the proper octets for line breaks if base64
> > encoding is applied directly to text material that has not been
> > converted to canonical form. In particular, text line breaks must be
> > converted into CRLF sequences prior to base64 encoding. The
> > important thing to note is that this may be done directly by the
> > encoder rather than in a prior canonicalization step in some
> > implementations.
> >
> > This is MIME, it specifies (in the same RFC):
>
> I've not spoken aboutr the encoding of new lines **in the actual encoded
> text**:
> - if their existing text-encoding ever gets converted to Base64 as if the
> whole text was an opaque binary object, their initial text-encoding will be
> preserved (so yes it will preserve the way these embedded newlines are
> encoded as CR, LF, CR+LF, NL...)
>
> I spoke about newlines used in the transport syntax to split the initial
> binary object (which may actually contain text but it does not matter).
> MIME defines this operation and even requires splitting the binary object
> in fragments with maximum binary size so that these binary fragments can be
> converted with Base64 into lines with maximum length. In the MIME Base64
> representation you can insert newlines anywhere between fragments encoded
> separately.
There's another kind of fragmentation that can make the encoding differ (but
still decode to the same payload):
The data stream gets split into 3-byte internal, 4-byte external packets.
Any packet may contain less than those 3 bytes, in which cases it is padded
with = characters:
3 bytes XXXX
2 bytes XXX=
1 byte XX==
Usually, such smaller packets happen only at the end of a message, but to
support encoding a stream piecewise, they are allowed at any point.
For example:
"meow" is bWVvdw==
"me""ow" is bWU=b3c=
yet both carry the same payload.
> Base64 is used exactly to support this flexibility in transport (or
> storage) without altering any bit of the initial content once it is
> decoded.
Right, any such variations are in packaging only.
ᛗᛖᛟᚹ
--
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢰⠒⠀⣿⡁ 10 people enter a bar: 1 who understands binary,
⢿⡄⠘⠷⠚⠋⠀ 1 who doesn't, D who prefer to write it as hex,
⠈⠳⣄⠀⠀⠀⠀ and 1 who narrowly avoided an off-by-one error.
More information about the Unicode
mailing list