Unicode String Models

Eli Zaretskii via Unicode unicode at unicode.org
Sun Sep 9 10:53:12 CDT 2018

> Date: Sun, 9 Sep 2018 16:10:26 +0200
> Cc: unicode Unicode Discussion <unicode at unicode.org>
> From: Philippe Verdy via Unicode <unicode at unicode.org>
> In practive, we use a memory by preparing the "small memory" while instantiating a new iterator that will
> process the whole string (which may not be fully loaded in memory, in which case that "small memory" will
> need reallocation as we also read the whole string (but not necessarily keep it in memory if it's a very long
> text file: the index buffer will still remain in memory even if we no longer need to come back to the start of the
> string). That "small memory" is just a local helper, its cost must be evaluated. In practice however, long texts
> come from I/O: the text will have its interface from files, in which case you'll benefit from the filesystem cache
> of the OS to save I/O, or from network (in which case you'll need to store the network data in a local
> temporary file if you don't want to keep it fully in memory and allow some parts to be paged out of memory by
> the OS. But in Emacs, it only works with files: network texts are necessarily backed at least by a local
> temporary file.

Emacs maintains caches for byte to character conversions for both
strings and buffers.  The cache holds data only for the last string
and separately the last buffer where Emacs needed to convert character
counts to byte counts or vice versa.  For buffers, there are 4 places
that are maintained for every buffer at all times, for which both the
character and byte positions are known, and Emacs uses those whenever
it needs to do conversions for a buffer that is not the cached one.

> So that "small memory" for the index is not even needed (but Emacs maintains an index in memory only to
> locate line numbers.

That's a different cache, unrelated to what Richard was alluding to
(and I think unrelated to the current discussion).

> Text editors use various indexing caches always, to manage memory, I/O, and allow working on large texts
> even on systems with low memory available. As much as possible they attempt to use the OS-level caches
> of the filesystem. And in all cases, they don't work directly on their text buffer (whose internal represenation in
> their backing store is not just a single string, but a structured collection of buffers, built on top of an interface
> masking the details: the effective text will then be reencoded and saved from that object, using complex
> serialization schemes; the text buffer is "virtualized").

In Emacs, buffer text is a character string with a gap, actually.

More information about the Unicode mailing list