Why Work at Encoding Level?
Mark Davis ☕️
mark at macchiato.com
Wed Oct 21 13:43:02 CDT 2015
On Wed, Oct 21, 2015 at 6:16 AM, Daniel Bünzli <daniel.buenzli at erratique.ch>
> Le mercredi, 21 octobre 2015 à 04:37, Mark Davis ☕️ a écrit :
> > If you're not, the question is relevant.
> I'm not disputing the question, I'm disputing trying to give it a defined
> answer. Even if your string is UTF-16 based these problems can be solved by
> providing proper abstractions at the library level and ask clients to
> handle the problem *once* when you inject the UTF-16 strings in your
> abstraction which can then operate in a "clean" world where these questions
> do not arise.
Again, a nice thought—
I am sympathetic to what you want. B
for most people
it runs into the brick wall of reality.
Let's take Java for example. You could clearly write your own StringX
class, that was logically UTF-32 (
like the Uniform model in
But modern products uses countless libraries to do their work, so you'll
end up converting every time you call one of those libraries or get back a
result. In the end, it might make your piece of code more reliable, but
there will be a certain cost. And you are still dependent on those other
Moreover, a key problem is the indexes. When you are calling out to an api
that takes a String and an index into that string, you could have a simple
method to return a String (if that is your internal representation). But
you will to convert from your codepoint index to the api's code unit index.
That either involves storing an interesting data structure in your StringX
object, or doing a scan, which is relatively expensive.
> Besides programming languages do evolve and one should at least make sure
> that new languages provide adequate abstractions for handling Unicode text.
> Looking at the recent batch of new languages I don't think this is
> happening. I'm sure language designers are keen on taking off-the shelf
> designs for this rather than get into the details and but I would say that
> TUS by defining notions of Unicode strings at the encoding level is not
> doing a very good job at providing one.
Unicode evolved over time, and had pretty severe constraints when it
originated. I agree that for a new language it would be cleaner to have a
> FWIW when I got into the standard around 2008 by reading that thick
> hard-copy of TUS 5.0, I took me quite some time to actually understand and
> uncover the real structure behind Unicode which are the scalar values.
Asmus put it nicely (why the thread split I don't know).
"When it comes to methods operating on buffers there's always the tension
between viewing the buffer as text elements vs. as data elements. For some
purposes, from error detection to data cleanup you need to be able to treat
the buffer as data elements. For many other operations, a focus on text
elements is enough.
If you desire to have a regex that you can use to validate a raw buffer,
then that regex must do something sensible with partial code points. If you
don't have multiple regex engines, then limiting your single one to valid
input prevents you from using it everywhere."
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode