Counting Codepoints

Richard Wordingham richard.wordingham at ntlworld.com
Tue Oct 13 13:53:29 CDT 2015


On Tue, 13 Oct 2015 15:23:36 +0000
David Starner <prosfilaes at gmail.com> wrote:

> A UTF-16 string could delete one surrogate, or add a fractional
> character. A Unicode string (not a "UTF-16 string"), which could be
> stored internally in, say, a Python-like format which is Latin-1,
> UCS-2, or UTF-32, conversions made as needed and differences hidden
> from the user, can't.

Confusingly, the Unicode definitions are the other way round.  A
UTF-16 string is a string of UTF-16 codepoints in which all surrogate
characters are paired surrogates.  Any string of UTF-15 code units may
is a Unicode 16-bit string. 

Richard.


More information about the Unicode mailing list