Proposal for BiDi in terminal emulators

Richard Wordingham via Unicode unicode at unicode.org
Sat Feb 2 10:35:13 CST 2019


On Sat, 2 Feb 2019 13:18:03 +0100
Egmont Koblinger via Unicode <unicode at unicode.org> wrote:

> Hi Richard,
> 
> On Sat, Feb 2, 2019 at 12:43 PM Richard Wordingham via Unicode
> <unicode at unicode.org> wrote:
> 
> > I'm not conversant with the details of terminal controls and I
> > haven't used fields.  However, where I spoke of lines above, I
> > believe you can simply translate it to fields.  I don't know how
> > one best handles fields - are they a list, possibly of rows within
> > fields, or are they stored as cell attributes?  
> 
> The very essential is that the terminal emulator stores "cells".
> Pretty much all the data (with very few exceptions) resides in cells.
> 
> A cell contains a base letter, followed by possibly a few non-spacing
> marks. A cell has a foreground color, background color, bold,
> underlined, italic etc. properties.
> 
> How these cells are linked up, in an array or whatever, is mostly
> irrelevant since it's likely to be different in every implementation.
> 
> Of course it is possible to extend the per-cell storage to contain a
> "previous" and a "next" character, as to be used for shaping purposes
> only. Some questions: Is this enough (e.g. aren't there cases where
> more than the immediate neighbor are relevant)? Is the next base
> character enough, or do we also need to know the combining accents
> that belong to that? And can't we store significantly less information
> than the actual letter (let's say, 1 out of 13 [randomly made up
> number] possible ways of shaping)?

Truncation at the start of the string gives us the clearest nasty.  If
you look at TUS Figure 13-7, you'll find that the final U+182D in
ᠵᠠᠷᠯᠢᠭ <U+1835, U+1820, U+1837, U+182F, U+1822, U+182D> _jarlig_
'order' and <U+1834, U+1822, U+1837, U+1822, U+182D> ᠴᠢᠷᠢᠭ_chirig_
'soldier' should be different because the former word has a masculine
vowel, namely U+1820, and latter doesn't. When written horizontally, the
Mongolian scipt is left-to-right, i.e. upside down compared to its
Aramaic ancestor.  What we need to note is the preceding
'gender'-determining vowel.

There are analogues of THAI CHARACTER SARA AM in the Tai Tham script -
<U+1A63 TAI THAM VOWEL SIGN AA, U+1A74 TAI THAM SIGN MAI KANG> and
<U+1A64 TAI THAM VOWEL SIGN TALL AA, U+1A74>.  In all the examples of
the latter I've seen, U+1A74 is placed over the preceding consonant, so
if U+1A64 is lost through lack of space, the U+1A74 should still
remain.  The former is a matter of style.  Outside Thailand, the mark
above is clearly associated (with one exception) with the U+1A74, so
both can safely vanish together.  In Thailand, the U+1A74 can be
associated with the consonant instead, or hover over the gap between
consonant and vowel.

The exception is the ligature <U+1A36 TAI THAM CONSONANT NA, U+1A63>.
That should really only get one cell.  The combination ᨶ᩶ᩣᩴ <U+1A36,
U+1A76 TAI THAM SIGN TONE-2, U+1A63, U+1A74> 'water, fluid' looks like
<NAA. U+1A74, U+1A76>.

There are then some interesting Indic phenomena depending on how one
treats subscript consonants.  The coding structure <Lo
consonant><stacker><Lo consonant><preposed mark> is widespread.

As a lesser from of this, in Khmer <consonant><U+17D2 KHMER SIGN
COENG><consonant><U+17B6 KHMER VOWEL SIGN AA> the first consonant and
U+17B6 ligate, and the ligation is highly visible on that
consonant even if the vowel is covered up.  If the display were to
chop off the second consonant, all that need be remembered is the
following vowel.

There is also the repha and analogues.  Repha is graphically a
superscript mark, but is usually encoded as <RA, VIRAMA>.  Burmese
kinzi is similar, but has a 3-character code.  They really ought to be
associated with the same cell as the immediately following consonant. 

The good news is that the record of the relevant neighbour can be
compressed to a few bits.

Richard.



More information about the Unicode mailing list