Bidi paragraph direction in terminal emulators BiDi in terminal emulators)

Eli Zaretskii via Unicode unicode at unicode.org
Mon Feb 4 10:51:12 CST 2019


> From: Egmont Koblinger <egmont at gmail.com>
> Date: Mon, 4 Feb 2019 00:36:23 +0100
> Cc: unicode at unicode.org
> 
> The Unicode BiDi algorithm states that it operates on paragraphs of
> text, and leaves it up to a higher protocol to define what a paragraph
> exactly is.
> 
> What's the definition of "paragraph" in the context of plain text files?
> 
> I don't think there's a single well-established practice.

Actually, UAX#9 defines "paragraph" as the chunk of text delimited by
paragraph separator characters.  This means characters whose bidi
category is B, which includes Newline, the CR-LF pair on Windows,
U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR.

> In some, e.g. in Emacs's TUTORIAL.he, or markdown files, it's way
> more complicated, probably there isn't a well-defined grammar for
> how exactly bullet list entries and alike should become new
> paragraphs.

Actually, Emacs implements the rule that paragraphs are separated by
empty lines.  This is documented in the Emacs manuals.  (That's by
default, users and Lisp programs can control that to some extent.)
This rule is global, and applied to any file or buffer, including
TUTORIAL.he.

> lorem ipsum FED ]> CBA foobar
> 
> The visual representation, in a narrower viewport, might wrap for
> example like this:
> 
> lorem ipsum CBA
> FED ]> foobar

I suggest to leave line wrapping alone for the moment: it is a further
complication.  Let's first talk about text whose every line ends in a
hard newline -- this is what you see in most "simple" text-mode
utilities which we are talking about.  If/when we solve the problems
there, we can then look at the issues with wrapping.

> Here comes the twist. Let's view this latter file with a viewer that
> uses a _different_ definition for paragraph. Let's view it in Gedit,
> Emacs, or the work-in-progress BiDi-aware VTE by "cat"ing it, where
> every newline begins a new paragraph – that's how these viewers define
> the notion of "paragraph" for the sake of BiDi.
> 
> The visual layout in these viewers becomes:
> 
> lorem ipsum CBA
> <[ FED foobar
> 
> which is just not correct. Since here BiDi is run on the two lines
> separately, the initial "<[" is treated as LTR, placed at the wrong
> location in the wrong order, and the glyphs aren't mirrored.

This kind of problems happens all the time, and you cannot avoid it.
Different programs display bidi text differently.  I propose not to
try to solve this problem, because IME it cannot be solved in general.
Let's focus on the terminal emulators that should comply with your
guidelines, and let's try to decide what should they do about base
paragraph direction of text emitted by "simple" text utilities.
If they all make decisions by the same rule, they all will show the
same text identically.

> Now, Emacs ships a TUTORIAL.he which, for most of its contents (but
> not everywhere) seems to treat runs between empty lines as paragraphs,

Correct.

> while Emacs itself is a viewer that treats runs between single
> newlines as paragraphs. That is, Emacs is inconsistent with itself.

Incorrect.  Emacs always treats a run of text between empty lines as a
single paragraph, in TUTORIAL.he and everywhere else.  There's nothing
special about TUTORIAL.he, it is just a plain text file with a few
dozen of bidi formatting controls, needed to show the key sequences
with weak and neutral characters in correct visual order.  (Some of
those controls can probably be removed nowadays, since we now have the
BPA of Unicode 6.3 -- the file was written before Unicode 6.3 was
released.)  In fact, I wrote that tutorial as an exercise, to prove to
myself that Emacs can be useful for editing non-trivial bidi text.

> In case you think I got something wrong with Emacs: Could you please
> give exact definitions:
> - What are the exact units (so-called "paragraphs" by UAX9) that it
> runs BiDi on when it loads and displays a file?

See above: for the purpose of the Emacs UBA implementation, paragraphs
are separated by empty lines.  That is the only rule in EMacs
regarding paragraph determination.

> - What are the exact units (so-called "paragraphs" by UAX9) in
> TUTORIAL.he on which BiDi needs to be run in order to get the desired
> readable version?

The same.  There's nothing special about that file.

> What most likely happens is that in order to see a difference, you'd
> need to have more special symbols, or at least a more special
> constellation of them. Probably TUTORIAL.he is just luckily simple
> enough that such a difference isn't hit.

No, TUTORIAL.he is neither "lucky" nor "simple".  I deliberately used
there almost every bidi formatting control there is, where
appropriate, to make sure this stiff works as intended in an otherwise
plain text file.

> Another possibility is (and I cannot check because I can't speak
> Hebrew) that somewhere TUTORIAL.he "cheats" with the logical order to
> get the desired visual one.

There's no cheating there, I assure you.

> This definition of paragraph (stuff between a newline and the next
> one) is the same as the one of Gedit, Emacs etc. when it comes to
> displaying BiDi text.

At least with Emacs, it is not the same.  I think considering each
line as a separate paragraph makes writing bidi plain-text documents
that look right almost impossible, if each line ends in a newline, as
customary in Emacs (and with "simple" text utilities).

> Now, it's possible to ponder about other, larger units as possible
> definitions. For certain files, surely the right approach would be to
> treat parts delimited by empty lines as paragraphs. But how far should
> we go? Should terminals understand markdown (one of the most terrible
> grammars I've ever seen) and all its popular flavors? Should it
> understand Emacs's TUTORIAL.he? Should it understand dpkg's format?
> What else?

My personal recommendation is to adopt the empty line rule.  It's
simple enough and gives good results IME.

> There's another conceptual problem here. Most terminal emulators don't
> understand a single bit of what happens inside them. They don't know
> where an application's output begins, where it ends. They don't know
> where the shell prompt is. In fact, they have no idea what a shell
> prompt is. They only see a single stream of incoming data to process
> (print printable characters, and obey to control instructions).
> 
> With the paragraph definition of "between a newline and the next one"
> this is not a problem, everything is doable based on what terminals
> already know.
> 
> With any other definition, e.g. if you define paragraphs as "separated
> by empty lines", still I'm sure you'd need the shell prompt to
> terminate the previous paragraph, start a new one (the prompt's and
> command line's), and even below the command line where the next
> utility's output begins it would also need to start a new paragraph.
> But we just don't have this information now.

I'm surprised that you describe this as such a complex problem.  I
think you explained up-thread that terminal emulators should cope with
lines of text arriving piecemeal, which I interpreted as meaning that
text is stored in the emulator's memory.  Modern emulators running on
windowed desktops also provide scroll-back buffers, and react to
expose events.  So I think the text that is currently in the viewport,
and also some text previously shown, are stored in memory, and can be
consulted.

However, I'm not an expert on this, so I will take your word that this
is a significant complication.  My point is that this is a
complication that must be solved; it cannot be ignored.  If you ignore
it and go for the "each line is a paragraph" rule, you will lose many
users; you will lose me for sure.

> This is why the only reasonable thing I can imagine is to define
> paragraph as newline-delimited segments, and leave it up for future
> enhancements to introduce other "paragraph" definitions as further
> options.

IME, this is a grave mistake.  I hope I explained why; it is now up to
you to decide what to do about that.


More information about the Unicode mailing list