Bidi paragraph direction in terminal emulators BiDi in terminal emulators

Egmont Koblinger via Unicode unicode at unicode.org
Wed Feb 6 15:01:59 CST 2019


Hi Eli,

(I'm getting lost where to reply, and how the subject gets mangled and
the thread split into different ones.)


I've thought about it a lot, experimented with Emacs's behavior, and
I've arrived at the conclusion that we are actually much closer to
each other than I had thought. Probably there's a lot of
misunderstanding due to different terminology we used.

I've set my terminal to RTL paragraph direction (via the relevant
escape sequence), then did a "cat TUTORIAL.he" (the file taken from
26.1), and compared to what I see in Emacs 25.2.2 – both the graphical
one, and the one running in a terminal of no BiDi.

Apart from a few minor irrelevant differences, they look the same! Hooray!!!

(The differences are:

- I had to slightly modify TUTORIAL.he to make sure none of the lines
start with a BiDi control (I added a preceding character) because
currently VTE doesn't support them, there's no character cell to store
this data. This definitely needs to be fixed in the second version of
my proposal.

- Emacs running in a terminal shows an underscore wherever there's a
BiDi control in the source file – while the graphical one doesn't.
This looks like a simple bug to me, right?

- Line 1007, the copyright line of this file uses visual indentation,
and Emacs detects LTR paragraph for that line. I think it should
rather use BiDi controls to have an overall RTL paragraph direction
detected, and within that BiDi controls to force LTR for the text. The
terminal shows it with RTL direction, as I manually set it.

Again, all these three details are irrelevant to my point, namely that
in WIP gnome-terminal it looks the same as in Emacs.)


You define paragraphs as emptyline-separated blocks on which you
perform autodetection of the paragraph direction. This is great! As
I've mentioned, I'd love to have such a mode in terminals, but it's
subject to underlying improvements, like knowing when a prompt starts
and ends, because prompts also have to be paragraph delimiters. You
convinced me that it's much more important than I thought, thanks a
lot for that! I will try to see if I can push for addressing the
prerequisite issues sooner. Indeed I had to manually set RTL paragraph
direction; with manual LTR or with per-line autodetection (as VTE can
do now) the result would be much worse.


Here's how the story continues from here. Here is where we
misunderstood each other (or at the very least I misunderstood you),
although we are talking about the same, doing things the same way:

The BiDi algorithm takes a paragraph of text at a time, and somehow
reshuffles its letters. UAX#9 section 3 starts by saying that the
first main phase is separation into "paragraphs". What are those
"paragraphs" that we're takling about _now_?

The thing is, both in Emacs as well as in my specification, it's a
logical line of the text (that is: delimited by single newlines). No,
in these steps, when UBA is run, the paragraph is no longer defined as
emptyline-delimited segments, it's defined as lines of the text.

To recap: The _paragraph direction_ is determined in Emacs for
emptyline-delimited segments of data, which I honestly find a great
thing, and would love to do in terminals too, alas at this point it's
blocked by some really nontrivial technical issues. But once you have
decided on a direction, each _line_ within that data is passed
separately to the BiDi algorithm to get reshuffled; this is what Emacs
does, this is what my specification says, and this is the right thing.
That is, for this step, the definition of "paragraph", as the BiDi
algorithm uses this term, is a line of the text file. This is where I
thought we had a disagreement, but we don't, we just misunderstood
each other.

-----

On a nitpicking side note:

It's damn ugly not to terminate a text file with a newline. Newline is
much better thought of a "terminator" than a "delimiter". For example,
if you do a "cat file1 file2", you expect file2 to start on its own
line.

Shouldn't this apply to paragraphs, too, especially when BiDi is in
the game? I'd argue that an empty line (double newline) shouldn't be a
delimiter, it should be a terminator for a paragraph. I think "cat
file1 file2" should make sure that the last paragraph of file1 and the
first paragraph of file2 are printed as separate paragraphs
(potentially with different paragraph direction), shouldn't it? I'd
argue that if a text file is formatted like TUTORIAL.he, with empty
lines denoting paragraph boundaries, then it should also end in an
empty line (that is: two newline characters).

-----

Feel free to skip the rest :)

Let's make a thought experiment. Let's assume that for running the
BiDi algorithm, we'd still stick to the emptyline-delimited paragraph
definition. This is not what you do, this is not what I do, but I
misunderstood that this is what you did, and I also thought this was a
good idea as a potential extension for the BiDi specs – I no longer
think so. This definition is truly problematic, as I'll show below.

The BiDi algorithm takes paragraphs of text, shuffles them, and
somewhere in the middle, with cooperation with the caller, cuts into
lines. It doesn't say a single word about the input potentially being
cut into lines, how it would handle them, how they would interfere
with the line breaks that the caller of the algorithm decides to add
etc. It makes sense: the BiDi algorithm converts a logical text into a
visual one, whereas single newlines within a paragraph would already
be visual elements, so the input string would be a mixture of the two
worlds (which probably doesn't make any sense per se).

Let's assume that the message I want to deliver is, written in its
logical order (left to right), is:

abc DEFGHIJKLM NOPQ rstuvwxyz

For whatever reason (e.g. I'd prefer to keep a 15 column margin in the
source file) it's split into two lines, that is, in the middle that's
a newline rather than a space:

abc<space>DEFGHIJKLM<newline>NOPQ<space>rstuvwxyz

A completely non-BiDi application would show the contents as

abc DEFGHIJKLM
NOPQ rstuvwxyz

If you run the BiDi algorithm on this unit as a whole paragraph, it
would not handle newline any differently from a space. It sees one
continous run of RTL text consisting of two words with a newline in
between, and reverses their order:

abc<space>QPON<newline>MLKJIHGFED<space>rstuvwxyz

Which would show up like this in a proper BiDi-aware viewer:

abc QPON
MLKJIHGFED rstuvwxyz

I can see two significant problems with this.

One is that because it can shuffle characters around the newline, it
breaks the principle that the eyes never have to move upwards.

The second is that the margin of 15 characters is no longer preserved.
The visual character (newline) no longer serves the visual purpose it
served in the logical order. Especially in terminals this could cause
a whole bunch of troubles. E.g. when an application believes that
printing some stuff moved the cursor down by 2 lines, it might have
actually moved it by 3 (if the terminal's overall width is also
15-ish, in this example). It's unclear how cursor positions, mouse
click positions (including on the "unused" area after the end of each
line) could be mapped, and so on. It's such a complex area that I
really wouldn't like to continue in this direction even if it was a
correct one, which luckily it isn't.

(I vaguely recall, from about a decade ago, that – presumably for
reasons along these lines – browsers have a huge problem with "<br>"
inside a paragraph when it comes to BiDi. I don't know where they
stand now, I'll investigate if it's important, but I don't think it
is.)

Luckily both Emacs and my specification shuffles the contents
separately within both lines (using LTR paragraph for both lines, as
it's guessed from the union of them), resulting in the desired:

abc MLKJIHGFED
QPON rstuvwxyz


Does this all make much more sense now? :)


cheers,
egmont

On Tue, Feb 5, 2019 at 5:09 PM Eli Zaretskii via Unicode
<unicode at unicode.org> wrote:
>
> > Date: Tue, 5 Feb 2019 00:05:47 +0000
> > From: Richard Wordingham via Unicode <unicode at unicode.org>
> >
> > > > Actually, UAX#9 defines "paragraph" as the chunk of text delimited
> > > > by paragraph separator characters. This means characters whose bidi
> > > > category is B, which includes Newline, the CR-LF pair on Windows,
> > > > U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR.
> >
> > It actually gives two different definitions. Table UAX#9 4 restricts
> > the type B to *appropriate newline functions; not all newlines are
> > paragraph separators.
>
> For what exactly is "appropriate newline function" one should read the
> Unicode Standard, section 5.8.  My conclusions from that are different
> from yours; see below.
>
> > > Indeed, this was an oversight on my side. So, with this definition,
> > > every single newline character starts a new paragraph. The result of
> > > printf "Hello\nWorld\n" > world.txt
> > > is a text file consisting of two paragraphs, with 5 characters in
> > > each. Correct?
> >
> > No, it depends on when a newline function is 'appropriate'. TUS 5.8
> > Rule R2b applies - 'In simple text editors, interpret any NLF the same
> > as LS'.
>
> That's not all of what the Standard says.  Just a couple of paragraphs
> above Rule R2b, there's this text:
>
>   Note that even if an implementer knows which characters represent
>   NLF on a particular platform, CR, LF, CRLF, and NEL should be
>   treated the same on input and in interpretation. Only on output is
>   it necessary to distinguish between them.
>
> So in practice, IMO the above example does constitute 2 paragraphs,
> regardless of the underlying platform's conventions.



More information about the Unicode mailing list