Proposal for BiDi in terminal emulators

Wed Jan 30 16:02:48 CST 2019

On Wed, 30 Jan 2019 15:33:38 +0100
Frédéric Grosshans via Unicode <unicode at unicode.org> wrote:

> Le 30/01/2019 à 14:36, Egmont Koblinger via Unicode a écrit :
> > - It doesn't do Arabic shaping. In my recommendation I'm arguing
> > that in this mode, where shuffling the characters is the task of
> > the text editor and not the terminal, so should it be for Arabic
> > shaping using presentation form characters.  
> 
> I guess Arabic shaping is doable through presentation form
> characters, because the latter are character inherited from legacy
> standards using them in such solutions.

So long as you don't care about local variants, e.g. U+0763 ARABIC
LETTER KEHEH WITH THREE DOTS ABOVE.  It has no presentation form
characters.

Basic Arabic shaping, at the level of a typewriter, is straightforward
enough to leave to a terminal emulator, as Eli has suggested.  Lam-alif
would be trickier - one cell or two?

> But if you want to support
> other “arabic like” scripts (like Syriac, N’ko), or even some LTR
> complex scripts, like Myanmar or Khmer, this “solution” cannot work,
> because no equivalent of “presentation form characters” exists for
> these scripts

I believe combining marks present issues even in implicit modes.  In
implicit mode, one cannot simply delegate the task to normal text
rendering, for one has to allocate text to cells.  There are a number
of complications that spring to mind:

1) Some characters decompose to two characters that may otherwise lay
claim to their own cells:

U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE decomposes to <06D2,
0654>.  Do you intend that your scheme be usable by Unicode-compliant
processes?

2) 2-part vowels, such as U+0D4A MALAYALAM VOWEL SIGN O, which
canonically decomposes into a preceding combining mark U+0D46 MALAYALAM
VOWEL SIGN E and following combining mark U+0D3E MALAYALAM VOWEL SIGN
AA.

3) Similar 2-part vowels that do not decompose, such as U+17C4 KHMER
VOWEL SIGN OO.  OpenType layout decomposes that into a preceding
'U+17C1 KHMER VOWEL SIGN E' and the second part.

4) Indic conjuncts.
(i) There are some conjuncts, such as Devanagari K.SSA, where a
display as <KA, VIRAMA>, <SSA> is simply unacceptable.  In some
closely related scripts, this conjunct has the status of a character.

(ii) In some scripts, e.g. Khmer, the virama-equivalent is not an
acceptable alternative to form a consonant stack.  Khmer could
equally well have been encoded with a set of subscript consonants in
the same manner as Tibetan.

(iii) In some scripts, there are marks named as medial consonants
which function in exactly the same way as <'virama', consonant>; it is
silly to render them in entirely different manners.

5) Some non-spacing marks are spacing marks in some contexts.  U+102F
MYANMAR VOWEL SIGN U is probably the best known example.

Richard.