UAX #9: applicability of higher-level protocols to bidi plaintext
Ken Whistler via Unicode
unicode at unicode.org
Thu Jul 19 20:10:49 CDT 2018
On 7/19/2018 12:38 AM, Shai Berger via Unicode wrote:
> If I cannot trust that
> people I communicate with make the same choices I make, plain text
> cannot be used.
Here is a counterexample. The following is a chunk of plain text output
from the bidi reference implementation:
Trace: Entering br_UBA_IdentifyIsolatingRunSequences [X10]
Current State: 6
Position: 0 1 2 3 4 5 6 7 8 9 10
11 12
Text: 05D0 2067 0061 2066 0061 202B 0061 202C 0061 2069 0061
2069 0061
Bidi_Class: R RLI L LRI L RLE L PDF L PDI L
PDI L
Levels: 1 1 3 3 4 x 5 x 4 3 3
1 1
Runs: <R-----R> <R-----L> <L-----R> <R-----R> <RL> <L-----R>
<R-----R>
Seqs (L= 1):
<R------[..............................................]------R>
Seqs (L= 3): <R------[..........................]------R>
Seqs (L= 4): <L-----R> <RL>
Seqs (L= 5): <R-----R>
If I just let that default to browser output choices (and assuming you
read your email with a proportional display font), it becomes almost
incomprehensible for casual reading, because the output has an
underlying assumption that there is column alignment across lines, which
in turn depends on a user choice of a fixed-width font for display.
Rectifying that, the reader would then see:
Trace: Entering br_UBA_IdentifyIsolatingRunSequences [X10]
Current State: 6
Position: 0 1 2 3 4 5 6 7 8 9 10
11 12
Text: 05D0 2067 0061 2066 0061 202B 0061 202C 0061 2069 0061
2069 0061
Bidi_Class: R RLI L LRI L RLE L PDF L PDI L
PDI L
Levels: 1 1 3 3 4 x 5 x 4 3 3
1 1
Runs: <R-----R> <R-----L> <L-----R> <R-----R> <RL> <L-----R>
<R-----R>
Seqs (L= 1):
<R------[..............................................]------R>
Seqs (L= 3): <R------[..........................]------R>
Seqs (L= 4): <L-----R> <RL>
Seqs (L= 5): <R-----R>
where now everything makes sense. (Well, at least if the UBA internals
are your thing!)
It isn't that "plain text cannot be used" to convey this content. The
content is certainly "legible" in the minimal sense required by the
Unicode Standard, and it is interchangeable without data corruption. The
problem is that for optimal display and interpretation as intended, I
also need to convey (and/or have the reader guess) the higher-level
protocol requirement that this particular plain text needs to be
displayed with a monowidth font.
> If the Unicode standard does not impose a
> universal default, it does not define interchangeable plain text.
And that is simply not the case. If your text is <a, b, c, !> (<L, L, L,
ON>), that will display as {abc!} in a LTR paragraph directional context
and as {!abc} in a RTL paragraph directional context. Reliably. It isn't
that we don't have interchangeable plain text. We do. What you cannot do
is predict exactly how that text will *display*, if you haven't agreed
with your interlocutor about paragraph direction. But substantively,
that is no different than the proportional versus monowidth font example
I just gave.
So I think this still really boils down to the putative requirement that
for something like "Hello, world!", bidi is just too weird, and that
somehow plain text shouldn't be allowed to behave that way. In other
words, if plain text doesn't forcefully carry with it and require how it
must be displayed, well, then it isn't really interchangeable.
But that isn't what the Unicode Standard means by plain text. And isn't
what it requires for interchangeability of plain text. (And yes, bidi is
weird!)
>
> My main point, whose rejection baffles me to no end, is that it should.
Well, I'm not expecting that I can make you feel good about the
situation. ;-) But perhaps the UTC position will seem a little less
baffling.
--Ken
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180719/e43411fd/attachment.html>
More information about the Unicode
mailing list