Why Work at Encoding Level?
daniel.buenzli at erratique.ch
Wed Oct 21 14:42:30 CDT 2015
Le mercredi, 21 octobre 2015 à 19:43, Mark Davis ☕️ a écrit :
> Moreover, a key problem is the indexes. When you are calling out to an api that takes a String and an index into that string, you could have a simple method to return a String (if that is your internal representation). But you will to convert from your codepoint index to the api's code unit index. That either involves storing an interesting data structure in your StringX object, or doing a scan, which is relatively expensive.
I'm not sure I fully understand what you wanted to say here. So I'm just trying to respond to the last sentence.
You can have an abstract datatype that *represents* a scalar value index in a string, it knows the exact byte index (or underlying storage element) at which the scalar value starts in the string but it hides the actual value from you. This allows you to access directly the scalar value without having to scan or store an interesting data structure in StringX while avoiding direct access to the underlying encoding.
The idea here is that direct random indexing is rarely needed; what happens most of the time is that you need to remember specific points in the string during a string traversal — for example think about delineating the substrings matching of a pattern. Whenever you hit these points the traversal function knows the exact byte index and can be used to yield values of the abstract index datatype.
> Unicode evolved over time, and had pretty severe constraints when it originated.
Sure, what I'm trying to say here, is that its presentation could maybe be modernized a bit by putting a greater emphasis on scalar values and less on their encoding. This could improve the messy conceptual model of Unicode I tend to find in the brain of my programmer peers.
> Asmus put it nicely (why the thread split I don't know).
> "When it comes to methods operating on buffers there's always the tension between viewing the buffer as text elements vs. as data elements. For some purposes, from error detection to data cleanup you need to be able to treat the buffer as data elements.
> If you desire to have a regex that you can use to validate a raw buffer, then that regex must do something sensible with partial code points.
I personally don't think this is a good or desirable way of operating. Sanitize inputs and treat encoding errors first at the IO boundary of your program; then process the cleaned up data on which you know strong invariant are holding.
More information about the Unicode