Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
Doug Ewell
doug at ewellic.org
Sat Jan 30 15:05:52 CST 2016
J Decker wrote:
> UTF16 has no way to define a code point that is D800-DFFF; this is an
> issue if I want to apply some sort of encryption algorithm and still
> have the result treated as text for transmission and encoding to other
> string systems.
Unpaired surrogates are not valid Unicode text. If you want to encrypt
data into 16-bit code units and have them treated as valid Unicode text,
the encryption algorithm must not generate unpaired surrogates.
This is not negotiable and not something you can be "partially"
compliant on. See Unicode Conformance Requirement C1: "A process shall
not interpret a high-surrogate code point or a low-surrogate code point
as an abstract character."
There's a reason this is "C1" and not farther down the list. It is
fundamental to Unicode.
> For my purposes I will implement F0000-F0800 to be (code point minus
> D800 and then add F0000 (or vice versa)) and then encoded as a
> surrogate pair...
This is fine for a private implementation where you are sure no input
will contain these PUA code points. Keep in mind that some people do use
them -- for example, they are assigned in the ConScript Unicode
Registry, which is unofficial and not affiliated with Unicode.
> it would have been super nice of unicode standards
> included a way to specify code point even if there isn't a language
> character assigned to that point.
It's not a question of whether a code point is assigned to a "language
character." There are hundreds of thousands of unassigned code points
that can be represented in any UTF, such as this one: , U+77777. But
unpaired surrogates can *never* be assigned to a character. If they
could, they would have failed in their basic purpose of extending
UTF-16.
--
Doug Ewell | http://ewellic.org | Thornton, CO
More information about the Unicode
mailing list