Counting Codepoints

Sun Oct 11 23:36:49 CDT 2015

On 10/11/2015 2:20 PM, Richard Wordingham wrote:
> Is the number of codepoints in a UTF-16 string well defined?
>
> For example, which of the following two statements are true?
>
> (a) The ill-formed three code-unit Unicode 16-bit string <0xDC00,
> 0xD800, 0xDC20> contains two codepoints, U+DC00 and U+10020.
>
> (b) The ill-formed three code-unit Unicode 16-bit string <0xDC00,
> 0xD800, 0xDC20> contains three codepoints, U+DC00, U+D800 and U+DC20.
>
> Statement (a) is probably more useful, but I couldn't find anything to
> rule that statement (b) is false.

I think the correct answer is probably:

(c) The ill-formed three code unit Unicode 16-bit string
<0xDC00, 0xD800, 0xDC20> contains one code point, U+10020 and
one uninterpreted (and uninterpretable) high surrogate
code unit 0xDC00.

In other words, I don't think it is useful or helpful to map isolated,
uninterpretable surrogate code units *to* surrogate code points.
Surrogate code points are an artifact of the code architecture. They
are code points in the code space which *cannot* be represented
in UTF-16, by definition.

Any discussion about properties for surrogate code points is a
matter of designing graceful API fallback for instances which
have to deal with ill-formed strings and do *something*. I don't
think that should extend to treating isolated surrogate code
units as having interpretable status, *as if* they were valid
code points represented in the string.

It might be easier to get a handle on this if folks were to ask, instead
how many code points are in the ill-formed Unicode 8-bit
string <0x61, 0xF4, 0x90, 0x90, 0x90, 0x61>. 6 code units, yes,
but how many code points? I'd say two code points and
4 uninterpretable, ill-formed UTF-8 code units, rather than
any other possible answer.

Basically, you get the same kind of answer if the ill-formed string
were, instead <0x61, 0xED, 0xA0, 0x80, 0x61>. Two code points
and 3 uninterpretable, ill-formed UTF-8 code units. That is a
better answer than trying to map 0xED 0xA0 0x80 to U+D800
and then saying, oh, that is a surrogate code *point*.

--Ken