Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

Egmont Koblinger via Unicode unicode at unicode.org
Sun Feb 3 17:36:23 CST 2019


Hi Eli,

(I'm responding in multiple emails.)


The Unicode BiDi algorithm states that it operates on paragraphs of
text, and leaves it up to a higher protocol to define what a paragraph
exactly is.

What's the definition of "paragraph" in the context of plain text files?

I don't think there's a single well-established practice. In some
particular text files, every explicit newline character starts a new
paragraph. In some (e.g. COPYING.GPL and friends), an empty line (that
is: two consecutive newline characters) separates two paragraphs. In
some, e.g. in Emacs's TUTORIAL.he, or markdown files, it's way more
complicated, probably there isn't a well-defined grammar for how
exactly bullet list entries and alike should become new paragraphs. In
the output of "dpkg -s packagename" consecutive lines indented by 1
space – except for those where there's only a single dot after the
space – form the human-perceived paragraphs. There are sure several
other syntaxes out there.

If the producer of a text file uses a different definition than the
viewer software, bugs can arise. I think this should be intuitively
obvious, but just in case, let me give a concrete example. In this
example I'll assume LTR paragraph direction set up by some external
means; with autodetected paragraph direction it's much easier to come
up with such breakages.


I wish to store and deliver the following text, as it's layed out here
in logical order. That is, the order as the bytes appear in the text
file, as I typed them from the keyboard, is laid out here strictly
from left to right, with uppercase standing for RTL letters, and no
mirroring:

lorem ipsum ABC <[ DEF foobar

The visual representation, what I expect to see in any decent viewer
software, is this one according to the BiDi algorithm this:

lorem ipsum FED ]> CBA foobar

The visual representation, in a narrower viewport, might wrap for
example like this:

lorem ipsum CBA
FED ]> foobar

which is still correct, given that logical "ABC <[ DEF" is a single
RTL run. (This assumes a viewer which, unlike Emacs, follows the
Unicode BiDi algorithm for wrapping a paragraph into multiple lines.)


Let's assume that me, as the producer of the text file, wish to create
a typical README in the spirit of COPYING.GPL and similar text files,
with the paragraph definition that two consecutive newline characters
(that is: a single empty line) delimit paragraphs; and a single
newline is equivalent to a space. Since I'd prefer to keep a margin of
16 characters in the source file (for demo purposes), I can take the
liberty of replacing the space after "ABC" by a single newline. (Maybe
my text editor does this automatically.) The file's contents, again
the logical order laid out from left to right, top to bottom, becomes
this:

lorem ipsum ABC
<[ DEF foobar

This file, accoring to the paragraph definition chosen earlier, is
equivalent to the unwrapped version shown before, and thus should
convey the same message.

If I view this file in a piece of software which uses the same
paragraph definition for BiDi purposes, the contents will appear as
expected. An example for such a viewer is a markdown converter's (that
leaves single newlines as-is, and adds a "<p>" at double newlines)
output viewed as an html file in a browser.


Here comes the twist. Let's view this latter file with a viewer that
uses a _different_ definition for paragraph. Let's view it in Gedit,
Emacs, or the work-in-progress BiDi-aware VTE by "cat"ing it, where
every newline begins a new paragraph – that's how these viewers define
the notion of "paragraph" for the sake of BiDi.

The visual layout in these viewers becomes:

lorem ipsum CBA
<[ FED foobar

which is just not correct. Since here BiDi is run on the two lines
separately, the initial "<[" is treated as LTR, placed at the wrong
location in the wrong order, and the glyphs aren't mirrored.


Now, Emacs ships a TUTORIAL.he which, for most of its contents (but
not everywhere) seems to treat runs between empty lines as paragraphs,
while Emacs itself is a viewer that treats runs between single
newlines as paragraphs. That is, Emacs is inconsistent with itself.

In case you think I got something wrong with Emacs: Could you please
give exact definitions:
- What are the exact units (so-called "paragraphs" by UAX9) that it
runs BiDi on when it loads and displays a file?
- What are the exact units (so-called "paragraphs" by UAX9) in
TUTORIAL.he on which BiDi needs to be run in order to get the desired
readable version?

What most likely happens is that in order to see a difference, you'd
need to have more special symbols, or at least a more special
constellation of them. Probably TUTORIAL.he is just luckily simple
enough that such a difference isn't hit.

Another possibility is (and I cannot check because I can't speak
Hebrew) that somewhere TUTORIAL.he "cheats" with the logical order to
get the desired visual one.

-----

Now, back to terminals.

The smallest possible viable definition of a "paragraph" in terminal
emulators is stuff between one newline and the next one.

It would require a hell lot of work, redesigning (overcomplicating)
plenty of basics of terminal emulation to be able to come up with
smaller units, e.g. cells of a table – a concept that doesn't
currently exist in this world –, I don't find any such approach
feasible at all.

This definition of paragraph (stuff between a newline and the next
one) is the same as the one of Gedit, Emacs etc. when it comes to
displaying BiDi text.

Now, it's possible to ponder about other, larger units as possible
definitions. For certain files, surely the right approach would be to
treat parts delimited by empty lines as paragraphs. But how far should
we go? Should terminals understand markdown (one of the most terrible
grammars I've ever seen) and all its popular flavors? Should it
understand Emacs's TUTORIAL.he? Should it understand dpkg's format?
What else?

There's another conceptual problem here. Most terminal emulators don't
understand a single bit of what happens inside them. They don't know
where an application's output begins, where it ends. They don't know
where the shell prompt is. In fact, they have no idea what a shell
prompt is. They only see a single stream of incoming data to process
(print printable characters, and obey to control instructions).

With the paragraph definition of "between a newline and the next one"
this is not a problem, everything is doable based on what terminals
already know.

With any other definition, e.g. if you define paragraphs as "separated
by empty lines", still I'm sure you'd need the shell prompt to
terminate the previous paragraph, start a new one (the prompt's and
command line's), and even below the command line where the next
utility's output begins it would also need to start a new paragraph.
But we just don't have this information now.

There are extensions used by some terminal emulators, and perhaps
they'll get "standardized" and more widely adopted to at least let the
terminal emulator know where the shell prompt and command line begins
and ends. But even if they're adopted by many emulators, there's still
a problem: is it going to be the shells (binaries) emit these
themselves, or should the user configure the prompt to contain them?
It's quite unlikely that we'll have buy-in from all the popular
shells. The prompts are maintained by all the users themselves, with
.bashrc or so defining them, this file is copied over from /etc/skel
once and then cannot be updated by distributions. Even if it's going
to happen, it'll take many-many years to come until we can safely rely
on this information being generally available.

For the problem set of having the same paragraph direction for
multiple paragraphs (e.g. an entire file, as cat'ed), we're also hit
by this limitation. Once the knowledge of where a command's output
begins and ends becomes available, we'll be able to do this, for
example say that the direction is autodetected on the command's output
as one unit, but then BiDi is applied on each line or each
emptyline-delimited fragment. We just don't have the necessary
information now, and won't have for a looong time.

This is why the only reasonable thing I can imagine is to define
paragraph as newline-delimited segments, and leave it up for future
enhancements to introduce other "paragraph" definitions as further
options.


cheers,
egmont



More information about the Unicode mailing list