Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

Philippe Verdy verdy_p at wanadoo.fr
Sun Jan 31 15:49:26 CST 2016


I also agree.

To transport binary data over a plain-text format there are other common
types, including Base64, Quoted-Printable (and you can also compress the
binary data before this transformation, using Gzip, deflate... for example
in MIME for emails; or compress it after this transformation only over the
transport channel like in HTTP which natively supports transparent 8-bit
streams, this solution being generally more performant).

There's no reliable way to preserve the exact binary encoding of texts
using invalid UTF sequences (including unpaired surrogates in UTF-16, or
isolated surrogate code points and other non-characters in other UTFs, or
forbidden byte values or restricted byte sequence in UTF-8) without using a
binary envelope (which cannot preserve the same encoding of valid UTF
sequences).

Even by using another encoding scheme/encoding form or legacy charset
mapped with Unicode (including GB and HKCS charsets), you will fail each
time due to the canonical equivalences and the existing conforming
conversions between all UTFs which are made to preserve the identity of
characters, not the equality of their binary encodings.

In summary, what you need is:
- a transport-syntax (see HTTP for example) to allow decoding your
envelope, and
- a separate media-type (see HTTP and MIME for example, don't choose any
one in "text/*", but in "binary/*" or possibly "application/*") or some
filesystem convention or standards for file types (such as file name
extensions in common Unix/Linux filesystems or FTP, or external metadata
streams for file attributes such as in MacOS, or VMS, or even in NTFS and
almost all HTTP-based filesystems) for your chosen binary encoding
encapsulated in a text-compatible format.

If your encoded document does not match exactly the strict text encoding
conformances, it cannot be declared and handled at all as if it was valid
text. You have to handle it as an opaque BLOB (as if they were data for a
bitmap image or executable code, or a PKI encryption key, or a data
signature such as SHA or an encrypted stream such as DES).

Basic filesystems for Unix/Linux or FAT treat all their files as
unrestricted blobs (that's why they use a separate data to represent its
actual type to decode it with specific algorithms, the most common being
filename extensions to determine the envelope format, then using internal
data structures in this envelope such as MPEG, OGG, or XML with schemas
validation, or ZIP archives embedding mutiple structured streams with some
conventions)

All these options are out of scope of the Unicode standard which is not
made to transport and preserve the binary encodings, but is made purposely
to allow transparent conversions between all conforming UTFs of valid text
only (nothing else) and to support canonical equivalences as much as
possible in "Unicode-conforming process", so that they'll be able to choose
between these wellknown and standardized text representations.

2016-01-31 20:52 GMT+01:00 Shawn Steele <Shawn.Steele at microsoft.com>:

> It should be understood that any algorithm that changes the Unicode
> character data to non-character data is therefore binary, and not Unicode.
> It's inappropriate to shove binary data into unicode streams because stuff
> will break.
>
> https://blogs.msdn.microsoft.com/shawnste/2005/09/26/avoid-treating-binary-data-as-a-string/
>
>
> -----Original Message-----
> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Chris
> Jacobs
> Sent: Sunday, January 31, 2016 10:08 AM
> To: J Decker <d3ck0r at gmail.com>
> Cc: unicode at unicode.org
> Subject: Re: Encoding/Use of pontial unpaired UTF-16 surrogate pair
> specifiers
>
>
>
> J Decker schreef op 2016-01-31 18:56:
> > On Sun, Jan 31, 2016 at 8:31 AM, Chris Jacobs <chris.jacobs at xs4all.nl>
> > wrote:
> >>
> >>
> >> J Decker schreef op 2016-01-31 03:28:
> >>>
> >>> I've reconsidered and think for ease of implementation to just mask
> >>> every UTF-16 character (not  codepoint) with a 10 bit value, This
> >>> will result in no character changing from BMP space to
> >>> surrogate-pair or vice-versa.
> >>>
> >>> Thanks for the feedback.
> >>
> >>
> >> So you are still trying to handle the unarmed output as plaintext.
> >> Do you realize that if a string in the output is replaced by a
> >> canonical equivalent one this may mess up things because the
> >> originals are not canonical equivalent?
> >>
> > I see ... things like mentioned here
> > http://websec.github.io/unicode-security-guide/character-transformatio
> > ns/
>
> Yes especially the part about normalization.
> This would not only spoil the normalized string, but also, as the string
> can have a different length, for anything after that your ever-changing
> xor-values may go out of sync.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160131/2e999d2f/attachment.html>


More information about the Unicode mailing list