Why Work at Encoding Level?

Tue Oct 13 18:28:26 CDT 2015

Le mardi, 13 octobre 2015 à 23:37, Richard Wordingham a écrit :
> If you are referring to indexing, I suspect the issue is performance.
> UTF-32 feels wasteful, and if the underlying character text is UTF-8 or
> UTF-16 we need an auxiliary array to convert character number to byte
> offset if we are to have O(1) time for access.

If UTF-32 feels wasteful there are various smart ways of providing direct indexing at a reasonable cost if you are in a language that has minimal support for datatype definition and abstraction.  

Also I personally find indexing to be rarely useful in string processing, so it may not be the operation you want to optimize for. Having iterators-like functions as you suggest and a datatype to represent substrings seems often a better fit than doing indexing arithmetic.

Note that the Swift programming language seems to have gone even further than I would have: their notion of character is a grapheme cluster tested for equality using canonical equivalence and that's what they index in their strings, see [1]. Don't know how well that works in practice as I personally never used it; but it feels like the ultimate Unicode string model you want to provide to the zero-knowledge Unicode programmer (at least for alphabetic scripts).

Best,

Daniel

[1] https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html