Proposal for BiDi in terminal emulators

Egmont Koblinger via Unicode unicode at unicode.org
Sat Feb 2 05:54:16 CST 2019


Hi Richard,

> > Are they okay to be present in visual order (the terminal's explicit
> > mode, what we're discussing now) too?
>
> Where do you define the order for explicit mode?

In explicit mode, the application (Emacs, Vim, whatever) reorders the
characters, and passes visual order (left to right) to the terminal
emulator. The terminal emulator preserves this visual order, doesn't
reshuffle anything.

How to handle ZW(N)J in visual order? What's the desired way? Is it
specified anywhere? As far as I know, they specify the relation
between two adjacent characters of the logical order, which might not
even become adjacent in the visual. Should they always "stick" to the
preceding character, for example?

The Unicode BiDi algorithm doesn't seem to make a difference between
base letters and combining accents for reordering. So, given in an RTL
text a base letter + a combining accent, the BiDi algorithm gives the
visual LTR order of the combining accent first (on the left), followed
by the base letter. This order is not okay for terminal emulators.
Combining accents have to be reordered in the output of the Unicode
BiDi algorithm, so that they come after the base letter even in the
visual LTR order. This is e.g. what FriBidi does by default, due to
the REORDER_NSM flag.

Presumably it doesn't just reorder non-spacing combining accents, but
also ZW(N)J and alike symbols too, which already smells pretty
problematic, doesn't it? Or is this what you need there, too?

> There may be complications in ensuring that
> <joiner control><letter><non-spacing marks><joiner control> gets stored
> as the content of a single cell.

How should the terminal emulator know which cell (the previous or the
subsequent) do these two <joiner control>s belong to?

> > Anyway, ZWJ/ZWNJ aren't sufficient to handle the cases I outlined
> > above.
>
> Example, please.

Cropped strings, cropped strings that are adjacent to each other, and
faulty shaping could kick in there.

Two fields on the UI. One in columns 36-40 with cyan background,
aiming to show ABCDEF, but due to limited room, can only show ABCDE
(let's say it's scrolled horizontally this way). Another in columns
41-45 with yellow background, aiming to show UVWXYZ, but due to
limited space only VWXYZ is shown (it's scrolled horizontally like
this).

What the terminal emulator sees is a continuous text of ABCDEVWXYZ.
What the application wants to have is to get E shaped as if there was
an F on its right, and get V shaped as if there was an U on its left.

Once you address this problem, I'm not sure ZW(N)J are still
required/desireable, rather than applying this more generic solution
there as well.

> At present, VTE positions LTR Indic preceding spacing combining marks
> after the consonant.  I though your draft scheme corrected this very
> local bidi issue, which is so local that the bidi algorithm ignores it.

Indic spacing combining marks are handled incorrectly by VTE and are
being addressed in bug 584160 which I've already linked. This
particular issue I don't consider BiDi at all. It's something totally
different. The spacing accent can be to the right, somewhat on top of
and somewhat to the right, on top of, somewhat to the left and
somewhat on top of, or fully on the left. It's not binary left or
right. Proper rendering should be done by font, and not at all by the
BiDi of the terminal. The terminal is unaware of how much the base
glyph is shifted to the right and the accent to its left. All that the
terminal needs to do (and VTE gets it wrong now) is to pass these two
into whichever font rendering engine in one single step.

> So ព្រះ <U+1796 KHMER LETTER PO, U+17D2 KHMER SIGN COENG, U+179A KHMER
> LETTER RO, U+17C8 KHMER SIGN > _preah_ 'prefix denoting
> repect for gods, kings, etc.' will be three cells <្រ,ព,ៈ> = <(COENG,
> RA), PO, YUUKALEAPINTU> and cause no confusion?  Or will the cells be
> <RA, (PO, COENG), YUUKALEAPINTU>?

First it's a base character followed by a non-spacing mark. As in most
terminal emulators (and now we're absolutely not talking about my BiDi
proposal) they are stored in the same cell. The first cell contains
(PO, COENG).

The next two are a base character followed by a spacing mark. In VTE
584160 I outline two possible approaches, but the one I'm in favor of,
is that the row's second cell contains RO and the third cell contains
YUUKALEAPINTU, which two are combined together properly when the
logical contains get displayed. Another possibility which I'm
pondering about is whether the emulation layer should combine them,
that is, have the second cell store the "first half of (RO, YUUKA)"
and the third cell store the "second half of (RO, YUUKA)".

Does this make any sense? If not, could you please explain what and
why is the desired behavior? Please keep in mind that I know nothing
about Khmer in particular.

Anyway, here we're talking about something that's totally independent
from my BiDi work. It's also something that should be standardized
across terminals, sure, but maybe not right now :)


cheers,
egmont



More information about the Unicode mailing list