Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers

Doug Ewell doug at ewellic.org
Sat Jan 30 15:05:52 CST 2016


J Decker wrote:

> UTF16 has no way to define a code point that is D800-DFFF; this is an
> issue if I want to apply some sort of encryption algorithm and still
> have the result treated as text for transmission and encoding to other
> string systems.

Unpaired surrogates are not valid Unicode text. If you want to encrypt 
data into 16-bit code units and have them treated as valid Unicode text, 
the encryption algorithm must not generate unpaired surrogates.

This is not negotiable and not something you can be "partially" 
compliant on. See Unicode Conformance Requirement C1: "A process shall 
not interpret a high-surrogate code point or a low-surrogate code point 
as an abstract character."

There's a reason this is "C1" and not farther down the list. It is 
fundamental to Unicode.

> For my purposes I will implement F0000-F0800 to be (code point minus
> D800 and then add F0000 (or vice versa)) and then encoded as a
> surrogate pair...

This is fine for a private implementation where you are sure no input 
will contain these PUA code points. Keep in mind that some people do use 
them -- for example, they are assigned in the ConScript Unicode 
Registry, which is unofficial and not affiliated with Unicode.

> it would have been super nice of unicode standards
> included a way to specify code point even if there isn't a language
> character assigned to that point.

It's not a question of whether a code point is assigned to a "language 
character." There are hundreds of thousands of unassigned code points 
that can be represented in any UTF, such as this one: ��, U+77777. But 
unpaired surrogates can *never* be assigned to a character. If they 
could, they would have failed in their basic purpose of extending 
UTF-16.

--
Doug Ewell | http://ewellic.org | Thornton, CO ����



More information about the Unicode mailing list