Proposal for BiDi in terminal emulators

Thu Jan 31 17:17:19 CST 2019

On Thu, 31 Jan 2019 12:46:48 +0100
Egmont Koblinger <egmont at gmail.com> wrote:

> Hi Richard,
> 
> > Basic Arabic shaping, at the level of a typewriter, is
> > straightforward enough to leave to a terminal emulator, as Eli has
> > suggested.  
> 
> What is "basic" Arabic shaping exactly?

Just using initial, medial and final forms, with no vertical stacking,
In terms of glyphs, none of glyphs of the presentation forms with
'LIGATURE' in the name would be used.

> I can see problems with leaving it to a terminal. It's not aware of
> the neighboring character if the string is cropped.

Cropped why?  If the problem is the truncation of lines, one can simple
store the next character.

> It's not able to
> separate different UI elements that happen to be adjacent in the
> terminal, separated by different background color or such.

ZWJ and ZWNJ can handle that.

> On the other hand, let's reverse the question:
> 
> "Basic Arabic shaping, at the level of a typewriter, is
> straightforward enough to be implemented in the application, using
> presentation form characters, as I suggest". Could you please point
> out the problems with this statement?

Apart from using presentation form characters in raw text being a sin?

If a general text manipulating application, e.g. cat, grep or awk, is
writing to a file, it should not convert normal Arabic characters to
presentation forms.  You are now asking a general application to
determine whether it is writing to a terminal or not, and alter its
output if it is writing to a terminal.  If the terminal window is
actually an emacs text buffer, I would not want such output to be
converted.  It is entirely natural to convert an emacs text buffer to
a file. 

> > I believe combining marks present issues even in implicit modes.  In
> > implicit mode, one cannot simply delegate the task to normal text
> > rendering, for one has to allocate text to cells.  There are a
> > number of complications that spring to mind:
> >
> > 1) Some characters decompose to two characters that may otherwise
> > lay claim to their own cells:
> >
> > U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE decomposes to
> > <06D2,  
> > 0654>.  Do you intend that your scheme be usable by
> > Unicode-compliant processes?  
> 
> Decompose during which step? During shaping?
> 
> Or do you mean they are NFC-NFD counterparts of each other?

The latter.

> > 4) Indic conjuncts.
> > (i) There are some conjuncts, such as Devanagari K.SSA, where a
> > display as <KA, VIRAMA>, <SSA> is simply unacceptable.  In some
> > closely related scripts, this conjunct has the status of a
> > character.  
> 
> We (in GNOME Terminal / VTE) do have an open bug about Devanagari
> spacing marks (currently they don't show up properly), plus Virama and
> friends. I'd like to address the essentials along with the BiDi
> implementation; although here we should discuss the design and not a
> particular implementation thereof :)
> 
> In case you're interested, at
> https://bugzilla.gnome.org/show_bug.cgi?id=584160 comments 45-48, 95
> and perhaps a few others comments I wondered whether certain joining
> operations should be done on the emulation layer or the display layer.
> The answer is not yet clear. We can't fix suddenly everything, but
> it's nice to move forward step by step. It's also proposed that we
> used HarfBuzz, but it's unclear to me at this point how the grid
> alignment could be preserved in the mean time.

Thanks for the link.

There are two different beasties.  There are text windows into which
the user and the application communicate using text, and this text
tends to be rendered properly, as one might aim to do with HarfBuzz,
and as an Emacs text buffer running the shell tries to do.  (Emacs
needs a lot of help - I can't write a generic Tai Tham OpenType .flt
file :-(  In my opinion, these are highly appropriate for application
like diff, grep and cat.  Do we have a good name for them/  They are,
perhaps, 'teletype emulators'.

> "simply unacceptable" – I'm not familiar with those languages,
> cultures and so on, but I'd be hesitant to go as far as calling
> anything "unacceptable". E.g. there's a physical typewriter in our
> family, as far as I remember it has no digits 1 or 0 (use the letters
> lowercase L and anycase O instead), it doesn't contain all the
> accented letters of my mother tounge so sometimes a similarly looking
> one has to be used. In today's computer world, I'd say such
> limitations are "unacceptable", but at that time this was what we had
> to live with.
> 
> Terminal emulators, due to their strict character grid nature and
> their legacy behavior of many decades, are a platform where a certain
> level of compromise might be necessary for some scripts. I cannot tell
> where to draw the line, cannot tell what is "extremely bad" vs. "not
> nice" vs. "kind of okay but could be better", but we can't do
> everything in a terminal emulator that a graphical app could do. If
> someone wants to have a pixel perfect look, terminal emulators are not
> for them. Maybe looking at typewriters of those scripts could be a
> good starting point. Anyway, we've drifted quite far away.

But it as an issue that needs to be addressed.  As a terminal can be
addressed by cell, an application may need to keep track of what text
went into each cell. Misery results when the application gets it wrong.

> What I've already implemented in VTE (in a work-in-progress branch),
> and to my eyes looks quite nice, is Arabic shape using presentation
> form characters as done by FriBidi (in implicit mode only). According
> to the API of this library, this shaping process keeps a 1:1 mapping
> between the original and shaped letters (at least the number of
> Unicode codepoints – I haven't double checked their terminal width,
> but I really hope they don't mess with us here). That is, I don't have
> to deal with a character cell splitting into two, or two character
> cells joining into one during shaping. Does this sound okay so far?

No.  How many cells do CJK ideographs occupy?  We've had a strong hint
that a medial BEH should occupy one cell, while an isolated BEH should
occupy two.

Richard.