Proposal for BiDi in terminal emulators

Richard Wordingham via Unicode unicode at unicode.org
Sat Feb 2 14:57:01 CST 2019


On Sat, 2 Feb 2019 12:54:16 +0100
Egmont Koblinger via Unicode <unicode at unicode.org> wrote:

> Hi Richard,
> 
> > > Are they okay to be present in visual order (the terminal's
> > > explicit mode, what we're discussing now) too?  
> >
> > Where do you define the order for explicit mode?  
> 
> In explicit mode, the application (Emacs, Vim, whatever) reorders the
> characters, and passes visual order (left to right) to the terminal
> emulator. The terminal emulator preserves this visual order, doesn't
> reshuffle anything.

Seriously, you need to give a definition of 'visual order' for this
context.  Not everyone shares your chiralist view.

> How to handle ZW(N)J in visual order? What's the desired way? Is it
> specified anywhere? As far as I know, they specify the relation
> between two adjacent characters of the logical order, which might not
> even become adjacent in the visual. Should they always "stick" to the
> preceding character, for example?

> The Unicode BiDi algorithm doesn't seem to make a difference between
> base letters and combining accents for reordering. So, given in an RTL
> text a base letter + a combining accent, the BiDi algorithm gives the
> visual LTR order of the combining accent first (on the left), followed
> by the base letter. This order is not okay for terminal emulators.
> Combining accents have to be reordered in the output of the Unicode
> BiDi algorithm, so that they come after the base letter even in the
> visual LTR order. This is e.g. what FriBidi does by default, due to
> the REORDER_NSM flag.

> Presumably it doesn't just reorder non-spacing combining accents, but
> also ZW(N)J and alike symbols too, which already smells pretty
> problematic, doesn't it? Or is this what you need there, too?

Even for logically ordered text, the positioning of the joiners is not
spelt out.  For example, I may have the sequence <NA, stacker, TA, SIGN
AA>, and want to specify the ligating behavior of NA.  I would chose
<NA, stacker, TA, ZWNJ, SIGN AA>, but this wouldn't let me choose
between it ligating with NA or with TA.

What happens when one selects text from the display?  I think this may
affect the choice of text representation for the cells.

For storing an explicit string in unnatural order free of bidi controls,
I would start with the equivalent implicit mode string, reverse it, and
pass that.  I believe the cell contents would then need to be reversed
again for rendering.  A good test case would be <U+05D3 HEBREW LETTER
DALET, U+05B1 HEBREW POINT HATAF SEGOL, ZWJ, U+05BD HEBREW POINT
METEG>; the ZWJ ligates the points, not base consonants.

> > There may be complications in ensuring that
> > <joiner control><letter><non-spacing marks><joiner control> gets
> > stored as the content of a single cell.  
> 
> How should the terminal emulator know which cell (the previous or the
> subsequent) do these two <joiner control>s belong to?

I think this has to depend on convention.  One scheme that might work
is, storing the contents in logical order:

<left><right> => <right> ZWJ and ZWJ <left>
<left>ZWJ<right> => <right> ZWJ and ZWJ <left>
<left>ZWNJ<right> => <right> and <left>
<left>ZWJ ZWNJ<right> => <right> and ZWJ <left>
<left>ZWNJ ZWJ<right> => <right> ZWJ and <left>

It may be better to have left and right conection bits in the cell
attributes instead of characters, and restore ZWJ and ZWNJ when the
text is cut and pasted from the terminal.  Note that storing
presentation forms in the terminal would, nowadays, normally cause cut
and paste to obtain an unfaithful copy of the original text. 

> > > Anyway, ZWJ/ZWNJ aren't sufficient to handle the cases I outlined
> > > above.  
> >
> > Example, please.  
> 
> Cropped strings, cropped strings that are adjacent to each other, and
> faulty shaping could kick in there.
> 
> Two fields on the UI. One in columns 36-40 with cyan background,
> aiming to show ABCDEF, but due to limited room, can only show ABCDE
> (let's say it's scrolled horizontally this way). Another in columns
> 41-45 with yellow background, aiming to show UVWXYZ, but due to
> limited space only VWXYZ is shown (it's scrolled horizontally like
> this).
> 
> What the terminal emulator sees is a continuous text of ABCDEVWXYZ.
> What the application wants to have is to get E shaped as if there was
> an F on its right, and get V shaped as if there was an U on its left.

Task:
So the text it's to show is parts of FEDCBA and ZYXWVU.  They are not
continuous with any other text in the terminal.  The display command
will not affect anything but columns 36 to 45.

Assumptions:
FEDCBA and ZYXWVU are each parts of right-to-left runs.

Solution:
The implicit mode text would be

<ZWNJ>ZYXWV<ZWJ>EDCBA<ZWNJ>

(This assumes that Z, V, E and A could otherwise join with the contents
of other cells.)

So send left-to-right text:

<ZWNJ>ABCDE<ZWJ>VWXYZ<ZWNJ>

> Once you address this problem, I'm not sure ZW(N)J are still
> required/desireable, rather than applying this more generic solution
> there as well.
> 
> > At present, VTE positions LTR Indic preceding spacing combining
> > marks after the consonant.  I though your draft scheme corrected
> > this very local bidi issue, which is so local that the bidi
> > algorithm ignores it.  
> 
> Indic spacing combining marks are handled incorrectly by VTE and are
> being addressed in bug 584160 which I've already linked. This
> particular issue I don't consider BiDi at all. It's something totally
> different. The spacing accent can be to the right, somewhat on top of
> and somewhat to the right, on top of, somewhat to the left and
> somewhat on top of, or fully on the left. It's not binary left or
> right. Proper rendering should be done by font, and not at all by the
> BiDi of the terminal. The terminal is unaware of how much the base
> glyph is shifted to the right and the accent to its left. All that the
> terminal needs to do (and VTE gets it wrong now) is to pass these two
> into whichever font rendering engine in one single step.

How many cells do consonant plus combining mark get between them?

> > So ព្រះ <U+1796 KHMER LETTER PO, U+17D2 KHMER SIGN COENG, U+179A
> > KHMER LETTER RO, U+17C8 KHMER SIGN > _preah_ 'prefix denoting
> > repect for gods, kings, etc.' will be three cells <្រ,ព,ៈ> =
> > <(COENG, RA), PO, YUUKALEAPINTU> and cause no confusion?  Or will
> > the cells be <RA, (PO, COENG), YUUKALEAPINTU>?  
> 
> First it's a base character followed by a non-spacing mark. As in most
> terminal emulators (and now we're absolutely not talking about my BiDi
> proposal) they are stored in the same cell. The first cell contains
> (PO, COENG).

> The next two are a base character followed by a spacing mark. In VTE
> 584160 I outline two possible approaches, but the one I'm in favor of,
> is that the row's second cell contains RO and the third cell contains
> YUUKALEAPINTU, which two are combined together properly when the
> logical contains get displayed. Another possibility which I'm
> pondering about is whether the emulation layer should combine them,
> that is, have the second cell store the "first half of (RO, YUUKA)"
> and the third cell store the "second half of (RO, YUUKA)".
> 
> Does this make any sense?

A visible U+17D2 has no rôle in the Khmer writing system.  On
computers, it is a warning that the input of a subscript consonant is
only half done.  There are three units of the writing system in that
word - KHMER LETTER PO, KHMER CONSONANT SIGN COENG RO*, and KHMER SIGN
YUUKALEAPINTU.

*a named sequence

> If not, could you please explain what and
> why is the desired behavior?

Why: ព្រះ is the rendering,

What: (a) Cell-by-cell rendering: <្រ,ព,ៈ> with dotted circles removed.
or (b) Cell-by-cell rendering: <ព្រះ,ៈ> with dotted circles removed.

A better scheme would be to render the three or two cells together using
a (sensu lato) monospaced font and display the result for the cells.

> Anyway, here we're talking about something that's totally independent
> from my BiDi work. It's also something that should be standardized
> across terminals, sure, but maybe not right now :)

It relates to the insistence that the number of cells assigned to a
character shall not depend on its context.  With the two-cell solution,
LETTER RO gets no cells - it is stored in the cell claimed by LETTER PO.

Richard.



More information about the Unicode mailing list