Counting Codepoints

David Starner prosfilaes at gmail.com
Tue Oct 13 10:23:36 CDT 2015


On Mon, Oct 12, 2015 at 11:42 PM Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> On Mon, 12 Oct 2015 23:35:32 +0000
> David Starner <prosfilaes at gmail.com> wrote:
>
> > Thus a Unicode string simply can't be in UTF-16 format
> > internally with unpaired surrogates; a Unicode string in a programmer
> > opaque format must do something with broken data on input.
>
> You're assuming that the source of the non-conformance is external to
> the program.  In the case that has caused me to ask about lone
> surrogates, they were actually caused by a faulty character deletion
> function within the program itself.  Despite this fault, the program
> remains usable - it's little worse than a word processor that insists on
> autocorrupting 'GHz' and 'MHz' to 'Ghz' and 'Mhz'.
>
> I presume you are expecting input of fractional characters to be
> buffered until there is a whole character to add to a string.  For
> example, a MSKLC keyboard will deliver a supplementary character in
> two WM_CHAR messages, one for the high surrogate and one for the low
> surrogate.
>

A UTF-16 string could delete one surrogate, or add a fractional character.
A Unicode string (not a "UTF-16 string"), which could be stored internally
in, say, a Python-like format which is Latin-1, UCS-2, or UTF-32,
conversions made as needed and differences hidden from the user, can't. If
you let the code delete one surrogate or add one surrogate, if you
interpret surrogates at all, it's a UTF-16 string; like often in computing,
it gives the programmer more power and control at the cost of being harder
to use and easier to break.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151013/56f8ad32/attachment.html>


More information about the Unicode mailing list