Why Work at Encoding Level?

Sun Oct 18 19:45:14 CDT 2015

On Wed, 14 Oct 2015 00:28:26 +0100
Daniel Bünzli <daniel.buenzli at erratique.ch> wrote:

> If UTF-32 feels wasteful there are various smart ways of providing
> direct indexing at a reasonable cost if you are in a language that
> has minimal support for datatype definition and abstraction.  

I can't find a good one that's been published.  The Elias-Fano encoding
for UTF-8 indexing works out at 3 to 5 bits per character even without
extending to achieve 'constant time' access, the limiting extremes
being English and Ugaritic. (Most SMP scripts use a lot of ASCII.)  For
genuine UTF-8 text I can happily get the memory requirement down to
1.031 bits per character.  I exploit the fact that one can easily
advance character by character through a UTF-8 string, but limit myself
to 5 advances. The 0.031 part of the factor comes in for strings longer
than a thousand characters, and could be reduced to 0.002 with some
extra processing. There's a lot of redundancy in the positions.

> Note that the Swift programming language seems to have gone even
> further than I would have: their notion of character is a grapheme
> cluster tested for equality using canonical equivalence and that's
> what they index in their strings, see [1]. Don't know how well that
> works in practice as I personally never used it; but it feels like
> the ultimate Unicode string model you want to provide to the
> zero-knowledge Unicode programmer (at least for alphabetic scripts).

It doesn't quite work.  For Thai at least, deleting backwards should
delete just a combining mark rather than the whole grapheme cluster.  I
couldn't find any provision for this in Swift.  There is also the
question (irrelevant for Thai) of whether this deletion should be done
in NFC or NFD.  Deleting backwards deleting only a combining mark also
makes sense for the International Phonetic Alphabet, as well as for the
Thai script used alphabetically (as often done for Pali) and for the
Lao script - the modern Lao writing system is formally an alphabet.

Richard.