"A Programmer's Introduction to Unicode"
alastair at alastairs-place.net
Tue Mar 14 03:51:18 CDT 2017
On 14 Mar 2017, at 02:03, Richard Wordingham <richard.wordingham at ntlworld.com> wrote:
> On Mon, 13 Mar 2017 19:18:00 +0000
> Alastair Houghton <alastair at alastairs-place.net> wrote:
>> IMO, returning code points by index is a mistake. It over-emphasises
>> the importance of the code point, which helps to continue the notion
>> in some developers’ minds that code points are somehow “characters”.
>> It also leads to people unnecessarily using UCS-4 as an internal
>> representation, which seems to have very few advantages in practice
>> over UTF-16.
> The problem is that UTF-16 based code can very easily overlook the
> handling of surrogate pairs, and one very easily get confused over what
> string lengths mean.
Yet the same problem exists for UCS-4; it could very easily overlook the handling of combining characters. As for string lengths, string lengths in code points are no more meaningful than string lengths in UTF-16 code units. They don’t tell you anything about the number of user-visible characters; or anything about the width the string will take up if rendered on the display (even in a fixed-width font); or anything about the number of glyphs that a given string might be transformed into by glyph mapping. The *only* think a string length of a Unicode string will tell you is the number of code units.
More information about the Unicode