Counting Codepoints

Richard Wordingham richard.wordingham at ntlworld.com
Tue Oct 13 01:36:30 CDT 2015


On Tue, 13 Oct 2015 00:49:29 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-10-12 21:38 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> > Graceful fallback is exactly where the issue arises.  Throwing an
> > exception is not a useful answer to the question of how many code
> > points a 'Unicode string' (not a 'UTF-16 string') contains.

> If you get an invalid UTF-16 string, and caught an exception, this is
> a sign that it is not UTF-16, and very frequently something else. The
> application may want to retry with another encoding, possibly using
> heuristic guessers, but the heuristic will only give a *probable
> answer*.

On Mon, 12 Oct 2015 23:35:32 +0000
David Starner <prosfilaes at gmail.com> wrote:

> Thus a Unicode string simply can't be in UTF-16 format
> internally with unpaired surrogates; a Unicode string in a programmer
> opaque format must do something with broken data on input.

You're assuming that the source of the non-conformance is external to
the program.  In the case that has caused me to ask about lone
surrogates, they were actually caused by a faulty character deletion
function within the program itself.  Despite this fault, the program
remains usable - it's little worse than a word processor that insists on
autocorrupting 'GHz' and 'MHz' to 'Ghz' and 'Mhz'.

I presume you are expecting input of fractional characters to be
buffered until there is a whole character to add to a string.  For
example, a MSKLC keyboard will deliver a supplementary character in
two WM_CHAR messages, one for the high surrogate and one for the low
surrogate.

Returning to the original questions, it would seem that there is not a
unique answer to the question of how many codepoints a Unicode 16-bit
string contains.  Rather the question must be the unwieldy one of how
many scalar values and lone surrogates it contains in total.

Richard.


More information about the Unicode mailing list