"A Programmer's Introduction to Unicode"

Tue Mar 14 07:21:27 CDT 2017

Alastair Houghton <alastair at alastairs-place.net> wrote:
 |On 13 Mar 2017, at 21:10, Khaled Hosny <khaledhosny at eglug.org> wrote:
 |> On Mon, Mar 13, 2017 at 07:18:00PM +0000, Alastair Houghton wrote:
 |>> On 13 Mar 2017, at 17:55, J Decker <d3ck0r at gmail.com> wrote:
 |>>> 
 |>>> I liked the Go implementation of character type - a rune type - \
 |>>> which is a codepoint.  and strings that return runes from by index.
 |>>> https://blog.golang.org/strings
 |>> 
 |>> IMO, returning code points by index is a mistake.  It over-emphasises
 |>> the importance of the code point, which helps to continue the notion
 |>> in some developers’ minds that code points are somehow “characters”.
 |>> It also leads to people unnecessarily using UCS-4 as an internal
 |>> representation, which seems to have very few advantages in practice
 |>> over UTF-16.
 |> 
 |> But there are many text operations that require access to Unicode code
 |> points. Take for example text layout, as mapping characters to glyphs
 |> and back has to operate on code points. The idea that you never need to
 |> work with code points is too simplistic.
 |
 |I didn’t say you never needed to work with code points.  What I said \
 |is that there’s no advantage to UCS-4 as an encoding, and that there’s \

Well, you do have eleven bits for flags per codepoint, for example.

 |no advantage to being able to index a string by code point.  As it \

With UTF-32 you can take the very codepoint and look up Unicode
classification tables.

 |happens, I’ve written the kind of code you cite as an example, including \
 |glyph mapping and OpenType processing, and the fact is that it’s no \
 |harder to do it with a UTF-16 string than it is with a UCS-4 string. \
 | Yes, certainly, surrogate pairs need to be decoded to map to glyphs; \
 |but that’s a *trivial* matter, particularly as the code point to glyph \
 |mapping is not 1:1 or even 1:N - it’s N:M, so you already need to cope \
 |with being able to map multiple code units in the string to multiple \
 |glyphs in the result.

If you have to iterate over a string to perform some high-level
processing then UTF-8 is a choice almost equally fine, for the
very same reasons you bring in.  And if the usage pattern
"hotness" pictures that this thread has shown up at the beginning
is correct, then the size overhead of UTF-8 that the UTF-16
proponents point out turns out to be a flop.

But i for one gave up on making a stand against UTF-16 or BOMs.
In fact i have turned to think UTF-16 is a pretty nice in-memory
representation, and it is a small step to get from it to the real
codepoint that you need to decide what something is, and what has
to be done with it.  I don't know whether i would really use it
for this purpose, though, i am pretty sure that my core Unicode
functions will (start to /) continue to use UTF-32, because the
codepoint to codepoint(s) is what is described, and onto which
anything else can be implemented.  I.e., you can store three
UTF-32 codepoints in a single uint64_t, and i would shoot myself
in the foot if i would make this accessible via an UTF-16 or UTF-8
converter, imho; instead, i (will) make it accessible directly as
UTF-32, and that serves equally well all other formats.  Of
course, if it is clear that you are UTF-16 all-through-the-way
then you can save the conversion, but (the) most (widespread)
Uni(x|ces) are UTF-8 based and it looks as if that would stay.
Yes, yes, you can nonetheless use UTF-16, but it will most likely
not safe you something on the database side due to storage
alignment requirements, and the necessity to be able to access
data somewhere.  You can have a single index-lookup array and
a dynamically sized database storage which uses two-byte
alignment, of course, then i can imagine UTF-16 is for the better.
I never looked how ICU does it, but i have been impressed by sheer
data facts ^.^

--steffen