Bidi paragraph direction in terminal emulators BiDi in terminal emulators)

Egmont Koblinger via Unicode unicode at unicode.org
Mon Feb 4 17:08:10 CST 2019


Hi Eli,

> Actually, UAX#9 defines "paragraph" as the chunk of text delimited by
> paragraph separator characters.  This means characters whose bidi
> category is B, which includes Newline, the CR-LF pair on Windows,
> U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR.

Indeed, this was an oversight on my side. So, with this definition,
every single newline character starts a new paragraph. The result of
printf "Hello\nWorld\n" > world.txt
is a text file consisting of two paragraphs, with 5 characters in each. Correct?

> Actually, Emacs implements the rule that paragraphs are separated by
> empty lines.  This is documented in the Emacs manuals.

That is, Emacs overrides UAX#9 and comes up with a different
definition? Furthermore, you argue that in terminals I should follow
Emacs's definition rather than Unicode's? Or please clarify if I
misunderstood you here.

> > while Emacs itself is a viewer that treats runs between single
> > newlines as paragraphs. That is, Emacs is inconsistent with itself.
>
> Incorrect.  Emacs always treats a run of text between empty lines as a
> single paragraph, in TUTORIAL.he and everywhere else.  There's nothing
> special about TUTORIAL.he, it is just a plain text file with a few
> dozen of bidi formatting controls, needed to show the key sequences
> with weak and neutral characters in correct visual order.  [...]

Thanks for the clarification, I believe it's clear to me now.

> At least with Emacs, it is not the same.  I think considering each
> line as a separate paragraph makes writing bidi plain-text documents
> that look right almost impossible, if each line ends in a newline [...]

> My personal recommendation is to adopt theempty line rule.  It's
> simple enough and gives good results IME. [...]

> I'm surprised that you describe this as such a complex problem.  I
> think you explained up-thread that terminal emulators should cope with
> lines of text arriving piecemeal, which I interpreted as meaning that
> text is stored in the emulator's memory.  Modern emulators running on
> windowed desktops also provide scroll-back buffers, and react to
> expose events.  So I think the text that is currently in the viewport,
> and also some text previously shown, are stored in memory, and can be
> consulted.

The problem is not the memory management.

Let's look at the following session:

---snip---
prompt$ cat file1.txt
This is the
first human-perceived paragraph.

And this is the
second.
prompt$ cat file2.txt
Here this is the
third paragraph.

And this one is
the fourth.
prompt$
---snip---

If you load the files to Emacs, it is perfectly aware of the contents
of the two files. It can define paragraphs however it wants to, and
BiDi the files accordingly.

The terminal emulator doesn't know what's a shell prompt, what's a
command that the user types, what's the output of that command. (You
don't know either from this snippet. Maybe I only cat'ed file1.txt,
and "prompt$ cat file2.txt" is just the sixth line of this eleven-line
file.)

In the terminal emulator's eyes, with Emacs's definition (empty line
delimited), this is one paragraph:

prompt$ cat file1.txt
This is the
first human-perceived paragraph.

and this is another paragraph:

And this is the
second
prompt$ cat file2.txt
Here this is the
third paragraph.

and similarly for the third one.

I believe I understand your concerns with the per-line paragraph
definition, but this interpretation that I've just shown most likely
leads to even more broken behavior.

It's a really nontrivial technical problem to let the terminal
emulator know where each prompt, and/or each command's output begins
and ends. There's work going on for letting the terminal emulator
recognize the prompts, but even if it's successful, it'll probably
take 5-10 years to reach the majority of the users. And it probably
still wouldn't solve the case of knowing the boundary between the two
outputs if a "cat file1.txt; cat file2.txt" is executed, let alone if
they're concatenated with "cat file1.txt file2.txt".

So, what you're arguing for, is that the default behavior should be
something that's:
- currently not implementable in a semantically correct way (to stop
around shell prompts) due to technical limitations, and
- isn't what Unicode says.

You have not convinced me that the pros outweigh the cons. That being
said, I'm more than open to see such a behavior as a future extension,
subject of course to the semantic prompt stuff being available.


cheers,
egmont


More information about the Unicode mailing list