Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

Philippe Verdy via Unicode unicode at unicode.org
Fri Oct 12 20:12:40 CDT 2018


I also think the reverse is also true !

Decoding a Base64 entity does not warranty it will return valid text in any
known encoding. So Unicode normalization of the output cannot apply.

Even if it represents text, nothing indicates that the result will be
encoded with some Unicode encoding form (unless this is tagged separately,
like in MIME).

If you use Base64 for decoding MIME contents (e.g. for emails), the Base-64
decoding itself will not transform the encoding, but then the email parser
will have to ensure that the text encoding is valid, at which time it will
have to transform it (possibly replace some invalid sequences or truncate
it), and then only it may apply normalization to help render that text. But
these transforms are part of the MIME application and independant of whever
you used Base-64 or any another binary encoding or transport syntax.

In other words: "If m is not equal to m', then t will not equal t'" is
reversible, but nothing indicates that m or m' Base64-decoded are texts,
they are just opaque binary objects which are still equal in value like
their t or t' Base64-encodings.

Note: some Base64 envelope formats (like MIME) allow multiple
representations t and t' from the same message m, by adding paddings or
transport syntaxes like line-splitting (with varaible length). Base64 alone
does not allow that variation (it normally uses a static alphabet), but
there are variants that accept decoding extended alphabets as binary
equivalent. So you may have two MIME-encoded texts that have different
encodings (with Base64 or Quopted-Printable, with variable line lengths)
but that represent the same source binary object, and decoding these
different encoded messages will yeld the same binary object: this does not
depend on Base64 but on the permissivity/flexibility of decoders for these
envelope formats (using **extensions** of Base64 specific to the envelope
format).


Le ven. 12 oct. 2018 à 18:27, Doug Ewell via Unicode <unicode at unicode.org>
a écrit :

> J Decker wrote:
>
> >> How about the opposite direction: If m is base64 encoded to yield t
> >> and then t is base64 decoded to yield n, will it always be the case
> >> that m equals n?
> >
> > False.
> > Canonical translation may occur which the different base64 may be the
> > same sort of string...
>
> Base64 is a binary-to-text encoding. Neither encoding nor decoding
> should presume any special knowledge of the meaning of the binary data,
> or do anything extra based on that presumption.
>
> Converting Unicode text to and from base64 should not perform any sort
> of Unicode normalization, convert between UTFs, insert or remove BOMs,
> etc. This is like saying that converting a JPEG image to and from base64
> should not resize or rescale the image, change its color depth, convert
> it to another graphic format, etc.
>
> So I'd say "true" to Roger's question.
>
> I touched on this a little bit in UTN #14, from the standpoint of trying
> to improve compression by normalizing the Unicode text first.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181013/c989a24f/attachment.html>


More information about the Unicode mailing list