Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

David Starner prosfilaes at gmail.com
Sat Jan 30 21:20:14 CST 2016


Obfuscate is right. It might conceivably be better than nothing, but at its
best it will stop someone for an hour or so. Why not run it through a
standard encryption protocol and if necessary use one of the options
mentioned before to turn it into valid text?

On Sat, Jan 30, 2016, 6:31 PM J Decker <d3ck0r at gmail.com> wrote:

> On Sat, Jan 30, 2016 at 4:45 PM, Shawn Steele
> <Shawn.Steele at microsoft.com> wrote:
> > Why do you need illegal unicode code points?
>
> This originated from learning Javascript; which is internally UTF-16.
> Playing with localStorage, some browsers use a sqlite3 database to
> store values.  The database is UTF-8 so there must be a valid
> conversion between the internal UTF-16 and UTF-8 localStorage (and
> reverse).  I wanted to obfuscate the data stored for a certain
> application; and cover all content that someone might send.  Having
> slept on this, I realized that even if hieroglyphics were stored, if I
> pulled out the character using codePointAt() and applied a 20 bit
> random value to it using XOR it could end up as a normal character,
> and I wouldn't know I had to use a 20 bit value... so every character
> would have to use a 20 bit mask (which could end up with a value
> that's D800-DFFF).
>
> I've reconsidered and think for ease of implementation to just mask
> every UTF-16 character (not  codepoint) with a 10 bit value, This will
> result in no character changing from BMP space to surrogate-pair or
> vice-versa.
>
> Thanks for the feedback.
> (sorry if I've used some terms inaccurately)
>
> >
> > -----Original Message-----
> > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of J Decker
> > Sent: Saturday, January 30, 2016 6:40 AM
> > To: unicode at unicode.org
> > Subject: Encoding/Use of pontial unpaired UTF-16 surrogate pair
> specifiers
> >
> > I do see that the code points D800-DFFF should not be encoded in any UTF
> format (UTF8/32)...
> >
> > UTF8 has a way to define any byte that might otherwise be used as an
> encoding byte.
> >
> > UTF16 has no way to define a code point that is D800-DFFF; this is an
> issue if I want to apply some sort of encryption algorithm and still have
> the result treated as text for transmission and encoding to other string
> systems.
> >
> > http://www.azillionmonkeys.com/qed/unicode.html   lists Unicode
> > private areas Area-A which is U-F0000:U-FFFFD and Area-B which is
> U-100000:U-10FFFD which will suffice for a workaround for my purposes....
> >
> > For my purposes I will implement F0000-F0800 to be (code point minus
> > D800 and then add F0000 (or vice versa)) and then encoded as a surrogate
> pair... it would have been super nice of unicode standards included a way
> to specify code point even if there isn't a language character assigned to
> that point.
> >
> > http://unicode.org/faq/utf_bom.html
> > does say: "Q: Are there any 16-bit values that are invalid?
> >
> > A: Unpaired surrogates are invalid in UTFs. These include any value in
> the range D800 to DBFF not followed by a value in the range DC00 to DFFF,
> or any value in the range DC00 to DFFF not preceded by a value in the range
> D800 to DBFF "
> >
> > and "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?
> >
> > A different issue arises if an unpaired surrogate is encountered when
> converting ill-formed UTF-16 data. By represented such an unpaired
> surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream
> would become ill-formed. While it faithfully reflects the nature of the
> input, Unicode conformance requires that encoding form conversion always
> results in valid data stream. Therefore a converter must treat this as an
> error. "
> >
> >
> >
> > I did see these older messages... (not that they talk about this much
> just more info)
> http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0204.html
> > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0209.html
> > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0210.html
> > http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0201.html
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160131/72a08694/attachment.html>


More information about the Unicode mailing list