Counting Codepoints

Mon Oct 12 14:38:18 CDT 2015

On Sun, 11 Oct 2015 21:36:49 -0700
Ken Whistler <kenwhistler at att.net> wrote:

> I think the correct answer is probably:
> 
> (c) The ill-formed three code unit Unicode 16-bit string
> <0xDC00, 0xD800, 0xDC20> contains one code point, U+10020 and
> one uninterpreted (and uninterpretable) high surrogate
> code unit 0xDC00.
> 
> In other words, I don't think it is useful or helpful to map isolated,
> uninterpretable surrogate code units *to* surrogate code points.
> Surrogate code points are an artifact of the code architecture. They
> are code points in the code space which *cannot* be represented
> in UTF-16, by definition.
> 
> Any discussion about properties for surrogate code points is a
> matter of designing graceful API fallback for instances which
> have to deal with ill-formed strings and do *something*. I don't
> think that should extend to treating isolated surrogate code
> units as having interpretable status, *as if* they were valid
> code points represented in the string.

Graceful fallback is exactly where the issue arises.  Throwing an
exception is not a useful answer to the question of how many code
points a 'Unicode string' (not a 'UTF-16 string') contains.  The
question can arise when one is following an instruction to advance x
codepoints; the usual presumption is that the preferred response is to
advance exactly x scalar values and not advance over anything else.

> It might be easier to get a handle on this if folks were to ask,
> instead how many code points are in the ill-formed Unicode 8-bit
> string <0x61, 0xF4, 0x90, 0x90, 0x90, 0x61>. 6 code units, yes,
> but how many code points? I'd say two code points and
> 4 uninterpretable, ill-formed UTF-8 code units, rather than
> any other possible answer.

In this case I'd say three 'somethings', and define
'something' accordingly.  There are different ideas as to what a
'something' should be.  Having a clear definition matters when
moving backwards and forwards through a Unicode 8-bit string.

> Basically, you get the same kind of answer if the ill-formed string
> were, instead <0x61, 0xED, 0xA0, 0x80, 0x61>. Two code points
> and 3 uninterpretable, ill-formed UTF-8 code units. That is a
> better answer than trying to map 0xED 0xA0 0x80 to U+D800
> and then saying, oh, that is a surrogate code *point*.

A simple scenario is a filter that takes in a single byte (or EOF) at a
time and returns a scalar value, 'no character yet', 'corrupt' or 'end
of text'.  It is a significant complication for it to have to emit
sequences of values indicating uninterpretable bytes.

I've found it much easier to treat bad sequences of UTF-8 code units
that are bad by reason of their length and indicated scalar value as a
single entity.  This simplifies moving forwards and backwards through
strings to just detecting non-continuation bytes and limiting traversal
through runs of continuation bytes. Otherwise, one must also check the
following continuation byte for a valid range.  For example, if one
starts at position 5 in your first example, just before the second
'A', one faces the following logic when moving back one codepoint.

1) Provisionally back up to position 1, just before 0xF4.
2) Confirm that one has skipped no more than 3 continuation bytes.
3) Conform that at least 3 continuation bytes follow the 0xF4. 
4) Examine the first continuation byte, 0x90, and realise that it is not
a legal value there.
5) Change to moving back one byte, arriving at position 4, just before
the last 0x90.

It gets even more complicated if one follows the "maximal subpart"
approach of TUS Ch. 3.

By contrast, one can even report the bad sequences in a 21-bit
extension of Unicode.  For example, one could use bits 20:16 to encode
the problem, e.g.:

0-16 => Valid scalar value (excludes 0xD800 to 0xDFFF)

1) Numbers that look like scalar values:

1.1) Value not a scalar value:

17 => 11xxxx (start F4 9y)
18 => 12xxxx (start F4 Ay)
19 => 13xxxx (start F4 By)
20 => Surrogate codepoint (start ED Ay or ED By) (2^11 seqq.)

1.2) Non-shortest form:
21 => 4 bytes long (start F0 8y) (image of BMP)
22 => 3 bytes long (start E0 8y or E0 9y) (2^11 seqq.)
23 => 2 bytes long (start C0 or C1) (image of ASCII)*

2) Uninterpretable sequences:
24 => Declared length 4 but actually 3 long (5 * 2^12 seqq.)
25 => Declared length 4 but actually 2 long (5 * 2^6 seqq.)
26 => Declared length 3 but actually 2 long (2^10 seqq.)
27 => Non-ASCII lone bytes (2^7 seqq.)*

* Not necessarily composed of UTF-8 code units.
17 => 11xxxx (start F4 9y)

In this scheme, <0x61, 0xF4, 0x90, 0x90, 0x90, 0x61> would be analysed
as <U+0061, V+110410, U+0061>, and the application could decide what to
do with V+110410.  It'd probably just be replaced by U+FFFD.

Richard.