Unicode String Models

Eli Zaretskii via Unicode unicode at unicode.org
Sun Sep 9 14:20:16 CDT 2018


> From: Philippe Verdy <verdy_p at wanadoo.fr>
> Date: Sun, 9 Sep 2018 19:35:47 +0200
> Cc: Richard Wordingham <richard.wordingham at ntlworld.com>, 
> 	unicode Unicode Discussion <unicode at unicode.org>
> 
>  In Emacs, buffer text is a character string with a gap, actually.
> 
> A text buffer with gaps is a complex structure, not just a plain string.

The difference is very small, and a couple of macros allow you to
almost forget about the gap.

> I doubt it constantly uses a single gap at end (insertions and deletions in the middle would
> constant move large blocks and use excessive CPU and memory bandwidth, with very slow response: users
> do not want to see what they type appearing on the screen at one keystroke every few seconds because each
> typed key causes massive block moves and excessive memory paging from/to disk while this move is being
> performed).

In Emacs, the gap is always where the text is inserted or deleted, be
it in the middle of text or at its end.

> All editors I have seen treat the text as ordered collections of small buffers (these small buffers may still have
> small gaps), which are occasionnally merged or splitted when needed (merging does not cause any
> reallocation but may free one of the buffers), some of them being paged out to tempoary files when memory is
> stressed. There are some heuristics in the editor's code to when mainatenance of the collection is really
> needed and useful for the performance.

My point was to say that Emacs is not one of these editors you
describe.

> But beside this the performance cost of UTF indexing of the codepoints is invisible: each buffer will only need
> to avoid breaking text between codepoint boundaries, if the current encoding of the edited text is an UTF. An
> editor may also avoid breaking buffers in the middle of clusters if they render clusters (including ligatures if
> they are supported): clusters are still small in size in every encoding and reasonnable buffer sizes can hold at
> least hundreds of clusters (even the largest ones which occur rarely). How editors will manage clusters to
> make them editable is dependant of the implementation, buyt even the UTF or codepoints boundaries are not
> enough to handle that. In all cases the logical text buffer is structured with a complex backing store, where
> parts may be paged out (and will also include more than just the current text, notably it will include parts of the
> indexes, possibly in another temporary working file).

You ignore or disregard the need to represent raw bytes in editor
buffers.  That is when the encoding stops being "invisible".


More information about the Unicode mailing list