Unicode String Models

Philippe Verdy via Unicode unicode at unicode.org
Sun Sep 9 12:35:47 CDT 2018


Le dim. 9 sept. 2018 à 17:53, Eli Zaretskii <eliz at gnu.org> a écrit :

> > Text editors use various indexing caches always, to manage memory, I/O,
> and allow working on large texts
> > even on systems with low memory available. As much as possible they
> attempt to use the OS-level caches
> > of the filesystem. And in all cases, they don't work directly on their
> text buffer (whose internal represenation in
> > their backing store is not just a single string, but a structured
> collection of buffers, built on top of an interface
> > masking the details: the effective text will then be reencoded and saved
> from that object, using complex
> > serialization schemes; the text buffer is "virtualized").
>
> In Emacs, buffer text is a character string with a gap, actually.
>

A text buffer with gaps is a complex structure, not just a plain string.
Gaps are one way to manage memory more efficiently and get reasonnable
performance when editing, without having to constantly move large blocks:
these "strings" with gaps may then actually be just a byte buffer using as
a backing store, but that buffer alone does not represent only the
currently represented text. A process will still serialize and perform
cleanup befire this buffer can be used to save the edited text. Emacs may
not necasserily unallocate the end of the buffer, but I doubt it constantly
uses a single gap at end (insertions and deletions in the middle would
constant move large blocks and use excessive CPU and memory bandwidth, with
very slow response: users do not want to see what they type appearing on
the screen at one keystroke every few seconds because each typed key causes
massive block moves and excessive memory paging from/to disk while this
move is being performed).

All editors I have seen treat the text as ordered collections of small
buffers (these small buffers may still have small gaps), which are
occasionnally merged or splitted when needed (merging does not cause any
reallocation but may free one of the buffers), some of them being paged out
to tempoary files when memory is stressed. There are some heuristics in the
editor's code to when mainatenance of the collection is really needed and
useful for the performance.

But beside this the performance cost of UTF indexing of the codepoints is
invisible: each buffer will only need to avoid breaking text between
codepoint boundaries, if the current encoding of the edited text is an UTF.
An editor may also avoid breaking buffers in the middle of clusters if they
render clusters (including ligatures if they are supported): clusters are
still small in size in every encoding and reasonnable buffer sizes can hold
at least hundreds of clusters (even the largest ones which occur rarely).
How editors will manage clusters to make them editable is dependant of the
implementation, buyt even the UTF or codepoints boundaries are not enough
to handle that. In all cases the logical text buffer is structured with a
complex backing store, where parts may be paged out (and will also include
more than just the current text, notably it will include parts of the
indexes, possibly in another temporary working file).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180909/e0e89307/attachment.html>


More information about the Unicode mailing list