Unicode String Models

Sun Sep 9 14:20:16 CDT 2018

> From: Philippe Verdy <verdy_p at wanadoo.fr>
> Date: Sun, 9 Sep 2018 19:35:47 +0200
> Cc: Richard Wordingham <richard.wordingham at ntlworld.com>, 
> 	unicode Unicode Discussion <unicode at unicode.org>
> 
>  In Emacs, buffer text is a character string with a gap, actually.
> 
> A text buffer with gaps is a complex structure, not just a plain string.

The difference is very small, and a couple of macros allow you to
almost forget about the gap.

> I doubt it constantly uses a single gap at end (insertions and deletions in the middle would
> constant move large blocks and use excessive CPU and memory bandwidth, with very slow response: users
> do not want to see what they type appearing on the screen at one keystroke every few seconds because each
> typed key causes massive block moves and excessive memory paging from/to disk while this move is being
> performed).

In Emacs, the gap is always where the text is inserted or deleted, be
it in the middle of text or at its end.

> All editors I have seen treat the text as ordered collections of small buffers (these small buffers may still have
> small gaps), which are occasionnally merged or splitted when needed (merging does not cause any
> reallocation but may free one of the buffers), some of them being paged out to tempoary files when memory is
> stressed. There are some heuristics in the editor's code to when mainatenance of the collection is really
> needed and useful for the performance.

My point was to say that Emacs is not one of these editors you
describe.

> But beside this the performance cost of UTF indexing of the codepoints is invisible: each buffer will only need
> to avoid breaking text between codepoint boundaries, if the current encoding of the edited text is an UTF. An
> editor may also avoid breaking buffers in the middle of clusters if they render clusters (including ligatures if
> they are supported): clusters are still small in size in every encoding and reasonnable buffer sizes can hold at
> least hundreds of clusters (even the largest ones which occur rarely). How editors will manage clusters to
> make them editable is dependant of the implementation, buyt even the UTF or codepoints boundaries are not
> enough to handle that. In all cases the logical text buffer is structured with a complex backing store, where
> parts may be paged out (and will also include more than just the current text, notably it will include parts of the
> indexes, possibly in another temporary working file).

You ignore or disregard the need to represent raw bytes in editor
buffers.  That is when the encoding stops being "invisible".