Why Work at Encoding Level?

Tue Oct 13 18:41:36 CDT 2015

Speed is not much linked to the in-memory buffer sizes (memory is cheap now
and cumfortable) and parsing in memory encodings is extremely fast.
The actual limitation is in I/O (network or storage on disk), and at this
level you work with network datagrams/packets, or disk buffers or memory
pages for paging, which are using buffers with static size (so the memory
allocation cost can be avoided as it is reusable).

Given that, you can easily create default buffers as small as about 4KB and
convert it from any encoding to another with a static auxiliary buffer also
small (16 KB for the worst cases) and manage with little cost the
transition that may occur in the middle of an encoding sequence. Working
with buffers considerably reduces the number of I/O performed, and you can
still compress it by chunk (just make sure your auxiliary buffer has enough
spare bytes at end for the worst case to avoid performing 2 I/O or
compressing two chucks including a degenerate one.

Even data compression is fast now and helps reducing the I/O : the cost of
compression in memory is small compared to the cost of I/O, so much that
now the Windows kernel can also use generic data compression for memory
page paging to improve the global performance of the system, when the
global memory page pool is full, or for disk virtualization purpose.

The UTF-8 encoding is extremely simple and very fast to implement, and for
most cases, it saves a lot compared to storing UTF-32 (including for large
collections of text elements in memory).

So using iterators is the way to go, it is simple to program, easy to
optimize, and you completely forget that UTF-8 is used in the background
store.

2015-10-14 0:37 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Tue, 13 Oct 2015 16:09:16 +0100
> Daniel Bünzli <daniel.buenzli at erratique.ch> wrote (under topic heading
> 'Counting Codepoints')
>
> > I don't understand why people still insist on programming with
> > Unicode at the encoding level rather than at the scalar value level.
> > Deal with encoding errors and sanitize your inputs at the IO boundary
> > of your program and then simply work with scalar values internally.
>
> If you are referring to indexing, I suspect the issue is performance.
> UTF-32 feels wasteful, and if the underlying character text is UTF-8 or
> UTF-16 we need an auxiliary array to convert character number to byte
> offset if we are to have O(1) time for access.
>
> This auxiliary array can be compressed chunk by chunk, but the larger
> the chunk, the greater the maximum access time.  The way it could work
> is a bit strange, because this auxiliary array is redundant.  For
> example, you could use it to record the location of every 4th or every
> 5th codepoint so as to store UTF-8 offset variation in 4 bits, or every
> 15th codepoint for UTF-16.  Access could proceed by looking up the
> index for the relevant chunk, then adding up nibbles to find the
> relevant recorded location within the chunk, and then use the basic
> character storage itself to finally reach the intermediate points.
>
> (I doubt this is an original idea, but I couldn't find it expressed
> anywhere.  It probably performs horribly for short strings.)
>
> Perhaps you are merely suggesting that people work with a character
> iterator, or in C refrain from doing integer arithmetic on pointers
> into strings.
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151014/773a7f23/attachment.html>