Proposal for BiDi in terminal emulators

Richard Wordingham via Unicode unicode at unicode.org
Fri Feb 1 22:01:42 CST 2019


On Fri, 1 Feb 2019 15:15:53 +0100
Egmont Koblinger via Unicode <unicode at unicode.org> wrote:

> Hi Richard,
> 
> On Fri, Feb 1, 2019 at 12:19 AM Richard Wordingham via Unicode
> <unicode at unicode.org> wrote:
> 
> > Cropped why?  If the problem is the truncation of lines, one can
> > simple store the next character.  
> 
> Yup, trancation of line for example.
> 
> I agree that one could "store the next character". We could extend the
> terminal emulation protocol where by some means you can specify that
> column 80 contains a letter X, and even though there's no column 81,
> an app can still tell the terminal emulator that it should imagine
> that column 81 contans the letter Y, and perform shaping accordingly.
> 
> This will need to be done not just at the end of the terminal, but at
> any position, and for both directions. Think of e.g. a vertically
> split tmux. You should be able to tell that column 40 contains X which
> should be shaped as if column 41 contained Y, and column 41 contains Z
> which should be shaped as if column 40 contained A.
> 
> What I canont see at all is how this could be "simply". Could you
> please elaborate on that? I don't find this simple at all!
> 
> >> > It's not able to  
> > > separate different UI elements that happen to be adjacent in the
> > > terminal, separated by different background color or such.  
> >
> > ZWJ and ZWNJ can handle that.  
> 
> Wouldn't it be a semantical misuse of these characters, though?

No.  ZWNJ is used before the inanimate plural suffix of Persian, and in
at least one language, <HEH, ZWJ> is used to distinguish one usage from
the digit ٥ (or is it the digit ۵?).

> They are supposed to be present in the logical order, and in logical
> order (that is: the terminal's implicit mode) they can work as
> desired.
> 
> Are they okay to be present in visual order (the terminal's explicit
> mode, what we're discussing now) too?

Where do you define the order for explicit mode?

There may be complications in ensuring that
<joiner control><letter><non-spacing marks><joiner control> gets stored
as the content of a single cell.

> 
> Anyway, ZWJ/ZWNJ aren't sufficient to handle the cases I outlined
> above.

Example, please.
> 
> > If a general text manipulating application, e.g. cat, grep or awk,
> > is writing to a file, it should not convert normal Arabic
> > characters to presentation forms.  You are now asking a general
> > application to determine whether it is writing to a terminal or
> > not, and alter its output if it is writing to a terminal.  
> 
> No, this absolutely not what I'm talking about!
> 
> There are two vastly different modes of the terminal. For "cat",
> "grep" etc. the terminal will be in implicit mode. Absolutely no BiDi
> handling is expected from these apps, the terminal will do BiDi and
> shaping (perhaps using Harfbuzz; perhaps using presentation form
> characters as a temporarily low hanging fruit until a better one is
> implemented – the choice is obviously up to the implementation and not
> to the specification).
> 
> For "emacs" and friends, an explicit mode is required where visual
> order is passed to the terminal. What we're discussing is how to
> handle shaping in this mode.

(Partitioning grapheme clusters and Indic syllables)
> > But it as an issue that needs to be addressed.  As a terminal can be
> > addressed by cell, an application may need to keep track of what
> > text went into each cell. Misery results when the application gets
> > it wrong.  
> 
> My recommendation doesn't change this principle at all. In the lower
> (emulation) layer every character still goes into the cell it used to
> go to, and is addressable using cursor motion escapes and so on
> exactly as without BiDi.

At present, VTE positions LTR Indic preceding spacing combining marks
after the consonant.  I though your draft scheme corrected this very
local bidi issue, which is so local that the bidi algorithm ignores it.
 
> 
> 
> > How many cells do CJK ideographs occupy?  We've had a strong hint
> > that a medial BEH should occupy one cell, while an isolated BEH
> > should occupy two.  
> 
> CJK occupy two, but they do regardless of what's around them. That is,
> they already occupy two cells in the logical buffers, in the emulation
> layer.
> 
> There is absolutely no sane way we can make in terminal emulation a
> character's logical width (as in number of cells it occupies) depend
> on its neighboring characters. (And even if we could by some terrible
> hacks, it would break the principle you just said as "misery
> results...", and the principle Eli said that things should remain
> reasonably simple, otherwise hardly anyone will bother implementing
> them.) This is a compromise Arabic folks will have to accept.

So ព្រះ <U+1796 KHMER LETTER PO, U+17D2 KHMER SIGN COENG, U+179A KHMER
LETTER RO, U+17C8 KHMER SIGN > _preah_ 'prefix denoting
repect for gods, kings, etc.' will be three cells <្រ,ព,ៈ> = <(COENG,
RA), PO, YUUKALEAPINTU> and cause no confusion?  Or will the cells be
<RA, (PO, COENG), YUUKALEAPINTU>?

Richard.



More information about the Unicode mailing list