"A Programmer's Introduction to Unicode"

Tue Mar 14 15:28:33 CDT 2017

On Tue, 14 Mar 2017 08:51:18 +0000
Alastair Houghton <alastair at alastairs-place.net> wrote:

> On 14 Mar 2017, at 02:03, Richard Wordingham
> <richard.wordingham at ntlworld.com> wrote:
> > 
> > On Mon, 13 Mar 2017 19:18:00 +0000
> > Alastair Houghton <alastair at alastairs-place.net> wrote:

> > The problem is that UTF-16 based code can very easily overlook the
> > handling of surrogate pairs, and one very easily get confused over
> > what string lengths mean.  
> 
> Yet the same problem exists for UCS-4; it could very easily overlook
> the handling of combining characters.

That's a different issue.  I presume you mean the issues of canonical
equivalence and detecting text boundaries.  Again, there is the problem
of remembering to consider the whole surrogate pair when using
UTF-16.  (I suppose this could be largely handled by avoiding the
concept of arrays.)  Now, the supplementary characters where these
issues arise are very infrequently used.  An error in UTF-16 code might
easily not come to attention, whereas a problem with UCS-4 (or UTF-8)
comes to light as soon as one handles Thai or IPA.

> As for string lengths, string
> lengths in code points are no more meaningful than string lengths in
> UTF-16 code units.  They don’t tell you anything about the number of
> user-visible characters; or anything about the width the string will
> take up if rendered on the display (even in a fixed-width font); or
> anything about the number of glyphs that a given string might be
> transformed into by glyph mapping.  The *only* think a string length
> of a Unicode string will tell you is the number of code units.

A string length in codepoints does have the advantage of being
independent of encoding.  I'm actually using an index for UTF-16
text (I don't know whether its denominated in codepoints or code
units) to index into the UTF-8 source code.  However, the number of code
units is the more commonly used quantity, as it tells one how much
memory is required for simple array storage.

Richard.