Proposal for BiDi in terminal emulators

Egmont Koblinger via Unicode unicode at unicode.org
Thu Jan 31 05:46:48 CST 2019


Hi Richard,

> Basic Arabic shaping, at the level of a typewriter, is straightforward
> enough to leave to a terminal emulator, as Eli has suggested.

What is "basic" Arabic shaping exactly?

I can see problems with leaving it to a terminal. It's not aware of
the neighboring character if the string is cropped. It's not able to
separate different UI elements that happen to be adjacent in the
terminal, separated by different background color or such.

On the other hand, let's reverse the question:

"Basic Arabic shaping, at the level of a typewriter, is
straightforward enough to be implemented in the application, using
presentation form characters, as I suggest". Could you please point
out the problems with this statement?

> I believe combining marks present issues even in implicit modes.  In
> implicit mode, one cannot simply delegate the task to normal text
> rendering, for one has to allocate text to cells.  There are a number
> of complications that spring to mind:
>
> 1) Some characters decompose to two characters that may otherwise lay
> claim to their own cells:
>
> U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE decomposes to <06D2,
> 0654>.  Do you intend that your scheme be usable by Unicode-compliant
> processes?

Decompose during which step? During shaping?

Or do you mean they are NFC-NFD counterparts of each other?

Most terminal emulators are able to handle combining accents, and of
course implicit mode would take them into account when rearranging the
letters. Terminal emulators don't do explicit (de)composing, a.k.a.
NFC->NFD or NFD->NFC conversion (at least I'm not aware of any that
does).

> 4) Indic conjuncts.
> (i) There are some conjuncts, such as Devanagari K.SSA, where a
> display as <KA, VIRAMA>, <SSA> is simply unacceptable.  In some
> closely related scripts, this conjunct has the status of a character.

We (in GNOME Terminal / VTE) do have an open bug about Devanagari
spacing marks (currently they don't show up properly), plus Virama and
friends. I'd like to address the essentials along with the BiDi
implementation; although here we should discuss the design and not a
particular implementation thereof :)

In case you're interested, at
https://bugzilla.gnome.org/show_bug.cgi?id=584160 comments 45-48, 95
and perhaps a few others comments I wondered whether certain joining
operations should be done on the emulation layer or the display layer.
The answer is not yet clear. We can't fix suddenly everything, but
it's nice to move forward step by step. It's also proposed that we
used HarfBuzz, but it's unclear to me at this point how the grid
alignment could be preserved in the mean time.

"simply unacceptable" – I'm not familiar with those languages,
cultures and so on, but I'd be hesitant to go as far as calling
anything "unacceptable". E.g. there's a physical typewriter in our
family, as far as I remember it has no digits 1 or 0 (use the letters
lowercase L and anycase O instead), it doesn't contain all the
accented letters of my mother tounge so sometimes a similarly looking
one has to be used. In today's computer world, I'd say such
limitations are "unacceptable", but at that time this was what we had
to live with.

Terminal emulators, due to their strict character grid nature and
their legacy behavior of many decades, are a platform where a certain
level of compromise might be necessary for some scripts. I cannot tell
where to draw the line, cannot tell what is "extremely bad" vs. "not
nice" vs. "kind of okay but could be better", but we can't do
everything in a terminal emulator that a graphical app could do. If
someone wants to have a pixel perfect look, terminal emulators are not
for them. Maybe looking at typewriters of those scripts could be a
good starting point. Anyway, we've drifted quite far away.


What I've already implemented in VTE (in a work-in-progress branch),
and to my eyes looks quite nice, is Arabic shape using presentation
form characters as done by FriBidi (in implicit mode only). According
to the API of this library, this shaping process keeps a 1:1 mapping
between the original and shaped letters (at least the number of
Unicode codepoints – I haven't double checked their terminal width,
but I really hope they don't mess with us here). That is, I don't have
to deal with a character cell splitting into two, or two character
cells joining into one during shaping. Does this sound okay so far?


cheers,
egmont



More information about the Unicode mailing list