Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

Philippe Verdy via Unicode unicode at unicode.org
Wed Feb 6 20:50:27 CST 2019


I read your email, you spoke for example about how a typical Unix/Linux
tool shows its usage option (e.g. "anycommand --help") with a leading line
then syntaxes and tabulated lists of options followed by translated help on
the same line.

There's some rules for correct display including with Bidi:

- Separate paragraphs that need a different default Bidi by double newlines
(to force a hard break)
- use a single newline on continuation
- if technical items are untranslatable, make sure they are at the begining
of lines and indented by some leading spaces, before translated ones.
- avoid breaking lists
- try to separate as much as posible text in natural languages from
technical texts.
- Be careful about correcty usage of leading punctuations (notably for list
items)
- Be consistant about indentation
- Normalize spaces,
- Don't ussume that TAB controls have the same width (ban TABS except at
the begining of lines)
- In column output, separate colums always with at least two spaces, don't
glue them as if they were sentences.
- Don't use "soft line breaks" in the middle of short lines (less than 72
base characters)
- Don't use any Bidi control !

With some cares, you can perfectly translate Linux/Unix tools in languages
needing Bidi and get consistant output, but be careful if your text
contains placeholders or technihcal untranslated terms (make sure to
surround them with paired punctuation, or don't translate them at all. And
avoid paragraphs that would mix natural and technical untranslatable terms
(such as command names or command-line options).

Make sure to test the output so that it will also work with varaible fonts
(don't assume monospaced fonts are used, they do not exist for various
scripts and don't work reliably for Arabic and most Asian scripts, and not
even for Chinese or Japanese even if these don't need Bidi support).

But the difficulty is not really in the terminal emulators but in the
source texts given to translators, when they don't know the context in
which the text will be used and have no hint about which terms should not
be translated (because they can become inconsistant: there are many
examples, even in Windows 10, where some of the command line tools are
completely unusable with the translated UI and with examples of syntaxes
that are not even working where some terms were randomly and inconsistantly
translated or confused, or because tools assumed an LTR-only layout of the
output, and monospaced fonts with one-to-one character per display cell, or
requiring specific fonts that do not contain the characters in their
monospaced variants: this is challenging notably for Asian scripts needing
complex clusters if you made these Latin-based assumptions)


Le mer. 6 févr. 2019 à 22:30, Egmont Koblinger <egmont at gmail.com> a écrit :

> Hi Philippe,
>
> Thanks a lot for your input!
>
> Another fundamental difficulty with terminal emulators is: These
> controls (CR, LF...) are control instructions that move the cursor in
> some ways, and then are forgotten. You cannot do BiDi on the
> instructions the terminal receives. You can only do BiDi on the
> result, the contents of the canvas after these instructions are
> executed. Here these controls are either lost, or you have to give a
> specification how exactly they need to be remembered, i.e. converted
> to being part of the canvas's data.
>
> Let's also mention that trying to get apps into using them is quite
> hopeless. The best you can do is design BiDi around what you already
> have, which pretty much means hard vs. soft line endings, and
> hopefully forthcoming semantical marks around shell prompts. (To
> overcomplicate the story, a received LF doesn't convert the line
> ending to hard wrapped in most terminal emulators. In some it does. I
> don't think there's an exact specification anywhere. Maybe the BiDi
> spec needs to create one. Lines are hard wrapped by default, turned to
> soft wrapped when the text gets wrapped at the end of the line, and a
> few random control functions turn them back to hard one, but in most
> terminals, a newline is not such a control function.)
>
> Anyway, please also see my previous email; I hope that clarifies a lot
> for you, too.
>
>
> cheers,
> egmont
>
> On Tue, Feb 5, 2019 at 5:53 PM Philippe Verdy via Unicode
> <unicode at unicode.org> wrote:
> >
> > I think that before making any decision we must make some decision about
> what we mean by "newlines". There are in fact 3 different functions:
> > - (1) soft line breaks (which are used to enforce a maximum display
> width between paragraph margins): these are equivalent to breakable and
> compressible whitespaces, and do not change the logical paragraph
> direction, they don't insert any additionnal vertical gap between lines, so
> the logicial line-height is preserved and continues uninterrupted. If text
> justification applies, this whitespace will be entirely collapsed into the
> end margin, and any text before it will stilol be justified to match the
> end margin (until the maximum expansion of other whitespaces in the middle
> is reached, and the maximum intercharacter gap is also reached (in which
> case, that line will not longer be expanded more), but this does not apply
> to terminal emulators that noramlly never use text justification, so the
> text will just be aligned to the start margin and whitespaces before it on
> the same line are preserved, and collapsed only at end of the line (just
> before the soft line break itself)
> > - (2) hard line breaks: they break to a new line but continue the
> paragraph within its same logical direction, but they are not compressible
> whitespaces (and do not depend on the logical end margin of the paragraph.
> > - (3) paragraph breaks: generally they introduce an addition vertical
> gap with top and bottom margins
> >
> > The problem in terminals is that they usually cannot distinguish types
> (1) and (2), they are simply encoded by a single CR, or LF, or CR+LF, or
> NEL. Type (1) is only existing within the framework of a higher level
> protocol which gives additional interpretation to these "newlines". The
> special control LS is almost never used but may be used for type (1) i.e.
> soft line-breaks, and will fallback to type (2) which is represented by the
> legacy "simple" newlines (single CR, or single LF, or single CR+LF, or
> single NEL). I have seen very little or no use of the LS (line separator)
> special control.
> >
> > Type (3) may be encoded with PS (paragraph separator), but in terminals
> (and common protocols line MIME) it is usually encoded using a couple of
> newline (CR+CR, or LF+LF, or CR+LF+CR+LF, or NL+NL) possibly with
> additional whitespaces (and additional presentation characters such as ">"
> in quotations inserted in mail responses) between them (needed for MIME and
> HTTP) which may be collapsed when rendering or interpreting them.
> >
> > Some terminal protocols can also use other legacy ASCII separators such
> as FS, GS, RS, US for grouping units containing multiple paragraphs, or
> STX/EOT pairs for encapsulating whole text documents in an
> protocol-specific enveloppe format (and will also use some escaping
> mechanism for special controls found in the middle, such as DLE+control to
> escape the control, or DLE+0 to escape a NUL, or DLE+# to escape a DEL, or
> DEL+x+NN where N are a fixed number of hexadecimal, decimal or octal
> digits. There's a wide variety of escaping mechanisms used by various
> higher-layer protocols (including transport protocols or encoding syntaxes
> used just below the plain-text layer, in a lower layer than the transport
> protocol layer).
> >
> > Le lun. 4 févr. 2019 à 21:46, Eli Zaretskii via Unicode <
> unicode at unicode.org> a écrit :
> >>
> >> > Date: Mon, 4 Feb 2019 19:45:13 +0000
> >> > From: Richard Wordingham via Unicode <unicode at unicode.org>
> >> >
> >> > Yes.  If one has a text composed of LTR and RTL paragraphs, one has to
> >> > choose how far apart their starting margins are.  I think that could
> >> > get complicated for plain text if the terminal has unbounded width.
> >>
> >> But no real-life terminal does.  The width is always bounded.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190207/8f1c9ad6/attachment.html>


More information about the Unicode mailing list