Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

Philippe Verdy via Unicode unicode at unicode.org
Mon Oct 15 06:13:41 CDT 2018


Look into https://tools.ietf.org/html/rfc4648, section 3.2, alinea 1, 1st
sentence, it is explicitly stated :

In some circumstances, the use of padding ("=") in base-encoded data
is not required or used.


Le lun. 15 oct. 2018 à 03:56, Tex <textexin at xencraft.com> a écrit :

> Philippe,
>
>
>
> Where is the use of whitespace or the idea that 1-byte pieces do not need
> all the equal sign paddings documented?
>
> I read the rfc 3501 you pointed at, I don’t see it there.
>
>
>
> Are these part of any standards? Or are you claiming these are practices
> despite the standards? If so, are these just tolerated by parsers, or are
> they actually generated by encoders?
>
>
>
> What would be the rationale for supporting unnecessary whitespace? If
> linebreaks are forced at some line length they can presumably be removed at
> that length and not treated as part of the encoding.
>
> Maybe we differ on define where the encoding begins and ends, and where
> higher level protocols prescribe how they are embedded within the protocol.
>
>
>
> Tex
>
>
>
>
>
>
>
>
>
> *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Philippe
> Verdy via Unicode
> *Sent:* Sunday, October 14, 2018 1:41 AM
> *To:* Adam Borowski
> *Cc:* unicode Unicode Discussion
> *Subject:* Re: Base64 encoding applied to different unicode texts always
> yields different base64 texts ... true or false?
>
>
>
> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
> enough to indicate the end of an octets-span. The extra = after it do not
> add any other octet. and as well you're allowed to insert whitespaces
> anywhere in the encoded stream (this is what ensures that the
> Base64-encoded octets-stream will not be altered if line breaks are forced
> anywhere (notably within the body of emails).
>
>
>
> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR,
> LF, NEL) in the middle is non-significant and ignorable on decoding (their
> "encoded" bit length is 0 and they don't terminate an octets-span, unlike
> "=" which discards extra bits remaining from the encoded stream before that
> are not on 8-bit boundaries).
>
>
>
> Also:
>
> - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol
> before "=" can vary in its 4 lowest bits (which are then ignored/discarded
> by the "=" symbol)
>
> - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X"
> symbol before "=" can vary in its 2 lowest bits (which are then
> ignored/discarded by the "=" symbol)
>
>
>
> So you can use Base64 by encoding each octet in separate pieces, as one
> Base64 symbol followed by an "=" symbol, and even insert any number of
> whitespaces between them: there's a infinite number of valid Base64
> encodings for representing the same octets-stream payload.
>
>
>
> Base64 allows encoding any octets streams but not directly any
> bits-streams : it assumes that the effective bits-stream has a binary
> length multiple of 8. To encode a bits-stream with an exact number of bits
> (not multiple of 8), you need to encode an extra payload to indicate the
> effective number of bits to keep at end of the encoded octets-stream (or at
> start):
>
> - Base64 does not specify how you convert a bitstream of arbitrary length
> to an octets-stream;
>
> - for that purpose, you may need to pad the bits-stream at start or at end
> with 1 to 6 bits (so that it the resulting bitstream has a length multiple
> of 8, then encodable with Base64 which takes only octets on input).
>
> - these extra padding bits are not significant for the original bitstream,
> but are significant for the Base64 encoder/decoder, they will be discarded
> by the bitstream decoder built on top of the Base64 decoder, but not by the
> Base64 decoder itself.
>
>
>
> You need to encode somewhere with the bitstream encoder how many padding
> bits (0 to 7) are present at start or end of the octets-stream; this can be
> done:
>
> - as a separate payload (not encoded by Base64), or
>
> - by prepending 3 bits at start of the bits-stream then padded at end with
> 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64
> encoding.
>
> - by appending 3 bits at end of the  bits-stream, just after 1 to 7 random
> bits needed to get a bit-length multiple of 8 suitable for Base64 encoding.
>
> Finally your bits-stream decoder will be able to use this padding count to
> discard these random padding bits (and possibly realign the stream on
> different byte-boundaries when the effective bitlength bits-stream payload
> is not a multiple of 8 and padding bits were added)
>
>
>
> Base64 also does not specify how bits of the original bits-stream payload
> are packed into the octets-stream input suitable for Base64-encoding,
> notably it does not specify their order and endian-ness. The same remark
> applies as well for MIME, HTTP. So lot of network protocols and file
> formats need to how to properly encode which possible option is used to
> encode bits-streams of arbitrary length, or need to specify which default
> choice to apply if this option is not encoded, or which option must be used
> (with no possible variation). And this also adds to the number of distinct
> encodings that are possible but are still equivalent for the same effective
> bits-stream payload.
>
>
>
> All these allowed variations are from the encoder perspective. For
> interoperability, the decoder has to be flexible and to support various
> options to be compatible with different implementations of the encoder,
> notably when the encoder was run on a different system. And this is the
> case for the MIME transport by mail, or for HTTP and FTP transports, or
> file/media storage formats even if the file is stored on the same system,
> because it may actually be a copy stored locally but coming from another
> system where the file was actually encoded).
>
>
>
> Now if we come back to the encoding of plain-text payloads, Unicode just
> specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code
> points (it actually does not mandate an exact bit-length because the range
> does not fully fit exactly to 21 bits and an encoder can still pack
> multiple code points together into more compact code units.
>
>
>
> However Unicode provides and standardizes several encodings (UTF-8/16/32)
> which use code units whose size is directly suitable as input for an
> octets-stream, so that they are directly encodable with Base64, without
> having to specify an extra layer for the bits-stream encoder/decoder.
>
>
>
> But many other encodings are still possible (and can be conforming to
> Unicode, provided they preserve each Unicode scalar value, or at least the
> code point identity because an encoder/decoder is not required to support
> non-character code points such as surrogates or U+FFFE), where Base64 may
> be used for internally generated octets-streams.
>
>
>
>
>
> Le dim. 14 oct. 2018 à 03:47, Adam Borowski via Unicode <
> unicode at unicode.org> a écrit :
>
> On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote:
> > Le sam. 13 oct. 2018 à 18:58, Steffen Nurpmeso via Unicode <
> > unicode at unicode.org> a écrit :
> > > The only variance is described as:
> > >
> > >   Care must be taken to use the proper octets for line breaks if base64
> > >   encoding is applied directly to text material that has not been
> > >   converted to canonical form.  In particular, text line breaks must be
> > >   converted into CRLF sequences prior to base64 encoding.  The
> > >   important thing to note is that this may be done directly by the
> > >   encoder rather than in a prior canonicalization step in some
> > >   implementations.
> > >
> > > This is MIME, it specifies (in the same RFC):
> >
> > I've not spoken aboutr the encoding of new lines **in the actual encoded
> > text**:
> > -  if their existing text-encoding ever gets converted to Base64 as if
> the
> > whole text was an opaque binary object, their initial text-encoding will
> be
> > preserved (so yes it will preserve the way these embedded newlines are
> > encoded as CR, LF, CR+LF, NL...)
> >
> > I spoke about newlines used in the transport syntax to split the initial
> > binary object (which may actually contain text but it does not matter).
> > MIME defines this operation and even requires splitting the binary object
> > in fragments with maximum binary size so that these binary fragments can
> be
> > converted with Base64 into lines with maximum length. In the MIME Base64
> > representation you can insert newlines anywhere between fragments encoded
> > separately.
>
> There's another kind of fragmentation that can make the encoding differ
> (but
> still decode to the same payload):
>
> The data stream gets split into 3-byte internal, 4-byte external packets.
> Any packet may contain less than those 3 bytes, in which cases it is padded
> with = characters:
> 3 bytes XXXX
> 2 bytes XXX=
> 1 byte  XX==
>
> Usually, such smaller packets happen only at the end of a message, but to
> support encoding a stream piecewise, they are allowed at any point.
>
> For example:
> "meow"     is bWVvdw==
> "me""ow"   is bWU=b3c=
> yet both carry the same payload.
>
> > Base64 is used exactly to support this flexibility in transport (or
> > storage) without altering any bit of the initial content once it is
> > decoded.
>
> Right, any such variations are in packaging only.
>
>
> ᛗᛖᛟᚹ
> --
> ⢀⣴⠾⠻⢶⣦⠀
> ⣾⠁⢰⠒⠀⣿⡁ 10 people enter a bar: 1 who understands binary,
> ⢿⡄⠘⠷⠚⠋⠀ 1 who doesn't, D who prefer to write it as hex,
> ⠈⠳⣄⠀⠀⠀⠀ and 1 who narrowly avoided an off-by-one error.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181015/5462df58/attachment.html>


More information about the Unicode mailing list