Unicode String Models

Sun Sep 9 02:59:29 CDT 2018

On Sat, 8 Sep 2018 18:36:00 +0200
Mark Davis ☕️ via Unicode <unicode at unicode.org> wrote:

> I recently did some extensive revisions of a paper on Unicode string
> models (APIs). Comments are welcome.
> 
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#

Theoretically at least, the cost of indexing a big string by codepoint
is negligible.  For example, cost of accessing the middle character is
O(1)*, not O(n), where n is the length of the string.  The trick is to
use a proportionately small amount of memory to store and maintain a
partial conversion table from character index to byte index.  For
example, Emacs claims to offer O(1) access to a UTF-8 buffer by
character number, and I can't significantly fault the claim.

*There may be some creep, but it doesn't matter for strings that can be
stored within a galaxy.

Of course, the coefficients implied by big-oh notation also matter.
For example, it can be very easy to forget that a bubble sort is often
the quickest sorting algorithm.

You keep muttering that a a sequence of 8-bit code units can contain
invalid sequences, but often forget that that is also true of sequences
of 16-bit code units.  Do emoji now ensure that confusion between
codepoints and code units rapidly comes to light?

You seem to keep forgetting that grapheme clusters are not how some
people people work.  Does the English word 'café' contain the letter
'e'?  Yes or no?  I maintain that it does.  I can't help thinking that
one might want to look for the letter 'ă' in Vietnamese and find it
whatever the associated tone mark is.

You didn't discuss substrings.  I'm interested in how subsequences of
strings are defined, as the concept of 'substring' isn't really Unicode
compliant.  Again, expressing 'ă' as a subsequence of the Vietnamese
word 'nặng' ought to be possible, whether one is using NFD (easier) or
NFC.  (And there are alternative normalisations that are compatible
with canonical equivalence.)  I'm most interested in subsequences X of a
word W where W is the same as AXB for some strings A and B.

Richard.