Counting Codepoints

Tue Oct 13 01:36:30 CDT 2015

On Tue, 13 Oct 2015 00:49:29 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-10-12 21:38 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> > Graceful fallback is exactly where the issue arises.  Throwing an
> > exception is not a useful answer to the question of how many code
> > points a 'Unicode string' (not a 'UTF-16 string') contains.

> If you get an invalid UTF-16 string, and caught an exception, this is
> a sign that it is not UTF-16, and very frequently something else. The
> application may want to retry with another encoding, possibly using
> heuristic guessers, but the heuristic will only give a *probable
> answer*.

On Mon, 12 Oct 2015 23:35:32 +0000
David Starner <prosfilaes at gmail.com> wrote:

> Thus a Unicode string simply can't be in UTF-16 format
> internally with unpaired surrogates; a Unicode string in a programmer
> opaque format must do something with broken data on input.

You're assuming that the source of the non-conformance is external to
the program.  In the case that has caused me to ask about lone
surrogates, they were actually caused by a faulty character deletion
function within the program itself.  Despite this fault, the program
remains usable - it's little worse than a word processor that insists on
autocorrupting 'GHz' and 'MHz' to 'Ghz' and 'Mhz'.

I presume you are expecting input of fractional characters to be
buffered until there is a whole character to add to a string.  For
example, a MSKLC keyboard will deliver a supplementary character in
two WM_CHAR messages, one for the high surrogate and one for the low
surrogate.

Returning to the original questions, it would seem that there is not a
unique answer to the question of how many codepoints a Unicode 16-bit
string contains.  Rather the question must be the unwieldy one of how
many scalar values and lone surrogates it contains in total.

Richard.