"A Programmer's Introduction to Unicode"
richard.wordingham at ntlworld.com
Tue Mar 14 15:28:33 CDT 2017
On Tue, 14 Mar 2017 08:51:18 +0000
Alastair Houghton <alastair at alastairs-place.net> wrote:
> On 14 Mar 2017, at 02:03, Richard Wordingham
> <richard.wordingham at ntlworld.com> wrote:
> > On Mon, 13 Mar 2017 19:18:00 +0000
> > Alastair Houghton <alastair at alastairs-place.net> wrote:
> > The problem is that UTF-16 based code can very easily overlook the
> > handling of surrogate pairs, and one very easily get confused over
> > what string lengths mean.
> Yet the same problem exists for UCS-4; it could very easily overlook
> the handling of combining characters.
That's a different issue. I presume you mean the issues of canonical
equivalence and detecting text boundaries. Again, there is the problem
of remembering to consider the whole surrogate pair when using
UTF-16. (I suppose this could be largely handled by avoiding the
concept of arrays.) Now, the supplementary characters where these
issues arise are very infrequently used. An error in UTF-16 code might
easily not come to attention, whereas a problem with UCS-4 (or UTF-8)
comes to light as soon as one handles Thai or IPA.
> As for string lengths, string
> lengths in code points are no more meaningful than string lengths in
> UTF-16 code units. They don’t tell you anything about the number of
> user-visible characters; or anything about the width the string will
> take up if rendered on the display (even in a fixed-width font); or
> anything about the number of glyphs that a given string might be
> transformed into by glyph mapping. The *only* think a string length
> of a Unicode string will tell you is the number of code units.
A string length in codepoints does have the advantage of being
independent of encoding. I'm actually using an index for UTF-16
text (I don't know whether its denominated in codepoints or code
units) to index into the UTF-8 source code. However, the number of code
units is the more commonly used quantity, as it tells one how much
memory is required for simple array storage.
More information about the Unicode