Unicode String Models

Philippe Verdy via Unicode unicode at unicode.org
Sun Sep 9 09:10:26 CDT 2018


Le dim. 9 sept. 2018 à 10:10, Richard Wordingham via Unicode <
unicode at unicode.org> a écrit :

> On Sat, 8 Sep 2018 18:36:00 +0200
> Mark Davis ☕️ via Unicode <unicode at unicode.org> wrote:
>
> > I recently did some extensive revisions of a paper on Unicode string
> > models (APIs). Comments are welcome.
> >
> >
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
>
> Theoretically at least, the cost of indexing a big string by codepoint
> is negligible.  For example, cost of accessing the middle character is
> O(1)*, not O(n), where n is the length of the string.  The trick is to
> use a proportionately small amount of memory to store and maintain a
> partial conversion table from character index to byte index.  For
> example, Emacs claims to offer O(1) access to a UTF-8 buffer by
> character number, and I can't significantly fault the claim.
>

I fully agree, as long as the "middle" character is **approximated** by the
middle of the **encoded** length.

But if it has to be the exact middle (by code point number), you have to
count the codepoints exactly by parsing the whole string as O(n), then
compute the middle from it and parse again from the begining to locate the
encoded position of that code point index as O(n/2) so the final cost is
O(n*3/2).

The trick using a "small amount" of memory only is only to avoid the second
parsing to get a O(n) result. You get O(1)* only if you keep that "small
memory" to locate ofthe indexes. But the claim that it is "small" is wrong
if the string is large (big value n). and has no interest if the string is
indexed only once.

In practive, we use a memory by preparing the "small memory" while
instantiating a new iterator that will process the whole string (which may
not be fully loaded in memory, in which case that "small memory" will need
reallocation as we also read the whole string (but not necessarily keep it
in memory if it's a very long text file: the index buffer will still remain
in memory even if we no longer need to come back to the start of the
string). That "small memory" is just a local helper, its cost must be
evaluated. In practice however, long texts come from I/O: the text will
have its interface from files, in which case you'll benefit from the
filesystem cache of the OS to save I/O, or from network (in which case
you'll need to store the network data in a local temporary file if you
don't want to keep it fully in memory and allow some parts to be paged out
of memory by the OS. But in Emacs, it only works with files: network texts
are necessarily backed at least by a local temporary file.

So that "small memory" for the index is not even needed (but Emacs
maintains an index in memory only to locate line numbers. It has no need to
do that for column numbers, as it is just faster to rescan the line (and
extremely long lines of text are exceptional, these files are rarely edited
with Emacs, unless you use it to load a binary file, whose representation
on screen will be very different, notably for controls, which are expanded
into another cached form: the column index for display, which is different
from the code point index and specific to the Emacs representation for
display/editing, is built only line by line, separately from the line index
kept for the whole edited file; it is also independant of the effective
encoding: it would still be needed even if the encoding of the backing
buffer was UTF-32 with only 1 codepoint per code unit, becase the actual
display will still expand the code points to other forms using visible
escaping mechanisms, and it is even needed when the file is pure 7-bit
ASCII, and kept with one byte per code point: choosing the Unicode encoding
forms has no impact at all to what is really needed for display in text
editors).

Text editors use various indexing caches always, to manage memory, I/O, and
allow working on large texts even on systems with low memory available. As
much as possible they attempt to use the OS-level caches of the filesystem.
And in all cases, they don't work directly on their text buffer (whose
internal represenation in their backing store is not just a single string,
but a structured collection of buffers, built on top of an interface
masking the details: the effective text will then be reencoded and saved
from that object, using complex serialization schemes; the text buffer is
"virtualized").

Only very basic text editors (such as Notepad) use a native single text
buffer, but they are very slow when editing very large files as they
constantly need to copy/move large blocks of memory to perform
inserts/deletions, and they also use too much the memory reallocator. Even
vi(m) or (s)ed in Unix/Linux now use another internal encoded form with a
temporary backing store in temporary files, created automatically when
needed as you start modifying the content. The final consolidation and
serialization will occur only when saving the result.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180909/0d8140e4/attachment.html>


More information about the Unicode mailing list