Why Work at Encoding Level?
daniel.buenzli at erratique.ch
Tue Oct 13 18:28:26 CDT 2015
Le mardi, 13 octobre 2015 à 23:37, Richard Wordingham a écrit :
> If you are referring to indexing, I suspect the issue is performance.
> UTF-32 feels wasteful, and if the underlying character text is UTF-8 or
> UTF-16 we need an auxiliary array to convert character number to byte
> offset if we are to have O(1) time for access.
If UTF-32 feels wasteful there are various smart ways of providing direct indexing at a reasonable cost if you are in a language that has minimal support for datatype definition and abstraction.
Also I personally find indexing to be rarely useful in string processing, so it may not be the operation you want to optimize for. Having iterators-like functions as you suggest and a datatype to represent substrings seems often a better fit than doing indexing arithmetic.
Note that the Swift programming language seems to have gone even further than I would have: their notion of character is a grapheme cluster tested for equality using canonical equivalence and that's what they index in their strings, see . Don't know how well that works in practice as I personally never used it; but it feels like the ultimate Unicode string model you want to provide to the zero-knowledge Unicode programmer (at least for alphabetic scripts).
More information about the Unicode