Bidi paragraph direction in terminal emulators

Richard Wordingham via Unicode unicode at
Sat Feb 9 08:06:48 CST 2019

On Sat, 09 Feb 2019 09:42:09 +0200
Eli Zaretskii via Unicode <unicode at> wrote:

> > Date: Sat, 9 Feb 2019 00:18:14 +0000
> > From: Richard Wordingham via Unicode <unicode at>
> >   
> > > For character composition, you must have a shaping engine to talk
> > > to, and the shaper should tell you the width of each grapheme
> > > cluster it returns.  
> > 
> > (a) What defines the grapheme clusters?  The definition might be
> > terminal-specific.  
> Well, the "you" above alluded to the terminal emulator, of course.
> The grapheme clusters are determined by the shaping engine that the
> emulator must call when appropriate (or always).

I find it very hard to believe that that is how it works with GNOME
Terminal (Version 3.18.3, using VTE Version 0.42.5).  At the command
line I typed in the Khmer script string ក្កេក (KA, COENG, KA, SIGN E,
KA), and saw the string split into four columns (KA, COENG), (KA),
(SIGN E), (KA), with each column given the same width. When written
correctly, SIGN E is first in visual order.  The fourth column was
displayed on top of the third column, which contained a dotted circle
to show that SIGN E on its own was not grammatically correct.  If I
were writing a Khmer font for use with Gnome terminal, I would attempt
to ensure that the display for SIGN E fitted in a single cell.

Of course, the renderer's grapheme cluster boundaries don't always
match appearances.  To get the traditional placement of U+1A58 TAI THAM
SIGN MAI KANG LAI, I end up with it being a mark glyph one cluster
later than HarfBuzz indicates it to be.

It would be good to be able to access a maintained statement of the
VTE rules for allocating characters to a cell, or group of cells, as

> > (b) With a terminal that expects a fixed width font, surely the
> > terminal decides how many cells it allocates to a group of
> > characters, and the font designer has to come up with a suitable
> > value based on that.   
> Yes.  A terminal emulator that works with a shaper should probably
> post-process the width information returned by the shaper for these
> purposes.

Perhaps it should base the number of cells on the width of the
clusters.  However, continuing with my example, U+1789 KHMER LETTER NYO
as a base character is too wide to fit in a cell, and the next
character will overwrite its right-hand part. From this I deduce that it
is allocated just one cell.  Gnome terminal is not alone in doing this,
but it does better than some, in my opinion, in that the overflow of the
foreground of one cell is not obliterated by the background of the
next cell.  U+1789 has an East Asian width property of 'Neutral', which
is distinctly unhelpful.

What I would like is a specification of what a font must do to avoid
such problems.

> > >  I don't see how you can expect wcwidth, or any other
> > > interface that was designed to work with _characters_, to be
> > > useful when you need to display grapheme clusters.  

It, or something similar but worse, gets used, especially when moving
the cursor for editing.

> > Well I can envisage a decision being made that a grapheme cluster
> > str (as decreed by the terminal) shall occupy wcswidth(str) cells -
> > "The wcswidth() function returns the number of column positions for
> > the wide-character string s, truncated to at most length n".  
> AFAIU, the shaping engine returns its output in terms of font glyph
> numbers, not character codepoints, so you cannot in general call
> wcswidth on them.  The shaper also returns the advance information,
> which serves instead of wcwidth and related APIs for determining the
> actual width on display.

Unfortunately, when the rectangular grid is being preserved,
typographical advance width is generally ignored when determining the
placement of characters.  Now, this is not always true; one can have
the situation where the the positioning of characters respects the
advance widths, but the positioning of the cursor assumes a fixed-width
rectangular grid.  I have found working with that to be extremely


More information about the Unicode mailing list