Surrogates and noncharacters

Sat May 9 09:26:34 CDT 2015

On Sat, 9 May 2015 15:11:51 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> Except that you are explaining something else. You are speaking about
> "Unicode strings" which are bound to a given UTF, I was speaking ONLY
> about "16-bit strings" which were NOT bound to Unicode (and did not
> have to). So TUS is compeltely not relevant here I have NOT written
> "Unicode 16-bit strings", only "16-bit strings" and I clearly opposed
> the two DISTINCT concepts in the SAME sentence so that no confusion
> was possible.

The long sentence of yours I am responding to is:

"And here you're wrong: a 16-bit string is just a sequence of arbitrary
16-bit code units, but an Unicode string (whatever the size of its code
units) adds restrictions for validity (the only restriction being in
fact that surrogates (when present in 16-bit strings, i.e. UTF-16) must
be paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates
are forbidden."

The point I made is that every string of 16-bit values is (valid
as) a Unicode string.  Do you accept that?  If not, please exhibit a
counter-example.

In particular, I claim that all 6 permutations of <D800, 0054, DCC1>
are Unicode strings, but that only two, namely <D800, DCC1, 0054> and
<0054, D800, DCC1>, are UTF-16 strings.

Richard.