"A Programmer's Introduction to Unicode"
alastair at alastairs-place.net
Tue Mar 14 03:44:01 CDT 2017
On 13 Mar 2017, at 21:10, Khaled Hosny <khaledhosny at eglug.org> wrote:
> On Mon, Mar 13, 2017 at 07:18:00PM +0000, Alastair Houghton wrote:
>> On 13 Mar 2017, at 17:55, J Decker <d3ck0r at gmail.com> wrote:
>>> I liked the Go implementation of character type - a rune type - which is a codepoint. and strings that return runes from by index.
>> IMO, returning code points by index is a mistake. It over-emphasises
>> the importance of the code point, which helps to continue the notion
>> in some developers’ minds that code points are somehow “characters”.
>> It also leads to people unnecessarily using UCS-4 as an internal
>> representation, which seems to have very few advantages in practice
>> over UTF-16.
> But there are many text operations that require access to Unicode code
> points. Take for example text layout, as mapping characters to glyphs
> and back has to operate on code points. The idea that you never need to
> work with code points is too simplistic.
I didn’t say you never needed to work with code points. What I said is that there’s no advantage to UCS-4 as an encoding, and that there’s no advantage to being able to index a string by code point. As it happens, I’ve written the kind of code you cite as an example, including glyph mapping and OpenType processing, and the fact is that it’s no harder to do it with a UTF-16 string than it is with a UCS-4 string. Yes, certainly, surrogate pairs need to be decoded to map to glyphs; but that’s a *trivial* matter, particularly as the code point to glyph mapping is not 1:1 or even 1:N - it’s N:M, so you already need to cope with being able to map multiple code units in the string to multiple glyphs in the result.
More information about the Unicode