UAX #9: applicability of higher-level protocols to bidi plaintext

Ken Whistler via Unicode unicode at unicode.org
Thu Jul 19 20:10:49 CDT 2018



On 7/19/2018 12:38 AM, Shai Berger via Unicode wrote:
> If I cannot trust that
> people I communicate with make the same choices I make, plain text
> cannot be used.

Here is a counterexample. The following is a chunk of plain text output 
from the bidi reference implementation:

Trace: Entering br_UBA_IdentifyIsolatingRunSequences [X10]
Current State: 6
   Position:       0    1    2    3    4    5    6    7    8    9 10   
11   12
   Text:        05D0 2067 0061 2066 0061 202B 0061 202C 0061 2069 0061 
2069 0061
   Bidi_Class:     R  RLI    L  LRI    L  RLE    L  PDF    L  PDI L  
PDI    L
   Levels:         1    1    3    3    4    x    5    x    4    3 3    
1    1
   Runs:        <R-----R> <R-----L> <L-----R> <R-----R> <RL> <L-----R> 
<R-----R>
   Seqs (L= 1): 
<R------[..............................................]------R>
   Seqs (L= 3): <R------[..........................]------R>
   Seqs (L= 4):                     <L-----R> <RL>
   Seqs (L= 5):                               <R-----R>

If I just let that default to browser output choices (and assuming you 
read your email with a proportional display font), it becomes almost 
incomprehensible for casual reading, because the output has an 
underlying assumption that there is column alignment across lines, which 
in turn depends on a user choice of a fixed-width font for display. 
Rectifying that, the reader would then see:

Trace: Entering br_UBA_IdentifyIsolatingRunSequences [X10]
Current State: 6
   Position:       0    1    2    3    4    5    6    7 8    9   10   
11   12
   Text:        05D0 2067 0061 2066 0061 202B 0061 202C 0061 2069 0061 
2069 0061
   Bidi_Class:     R  RLI    L  LRI    L  RLE    L  PDF L  PDI    L  
PDI    L
   Levels:         1    1    3    3    4    x    5    x 4    3    3    
1    1
   Runs:        <R-----R> <R-----L> <L-----R> <R-----R> <RL> <L-----R> 
<R-----R>
   Seqs (L= 1): 
<R------[..............................................]------R>
   Seqs (L= 3): <R------[..........................]------R>
   Seqs (L= 4): <L-----R>           <RL>
   Seqs (L= 5): <R-----R>

where now everything makes sense. (Well, at least if the UBA internals 
are your thing!)

It isn't that "plain text cannot be used" to convey this content. The 
content is certainly "legible" in the minimal sense required by the 
Unicode Standard, and it is interchangeable without data corruption. The 
problem is that for optimal display and interpretation as intended, I 
also need to convey (and/or have the reader guess) the higher-level 
protocol requirement that this particular plain text needs to be 
displayed with a monowidth font.

> If the Unicode standard does not impose a
> universal default, it does not define interchangeable plain text.

And that is simply not the case. If your text is <a, b, c, !> (<L, L, L, 
ON>), that will display as {abc!} in a LTR paragraph directional context 
and as {!abc} in a RTL paragraph directional context. Reliably. It isn't 
that we don't have interchangeable plain text. We do. What you cannot do 
is predict exactly how that text will *display*, if you haven't agreed 
with your interlocutor about paragraph direction. But substantively, 
that is no different than the proportional versus monowidth font example 
I just gave.

So I think this still really boils down to the putative requirement that 
for something like "Hello, world!", bidi is just too weird, and that 
somehow plain text shouldn't be allowed to behave that way. In other 
words, if plain text doesn't forcefully carry with it and require how it 
must be displayed, well, then it isn't really interchangeable.

But that isn't what the Unicode Standard means by plain text. And isn't 
what it requires for interchangeability of plain text. (And yes, bidi is 
weird!)

>
> My main point, whose rejection baffles me to no end, is that it should.

Well, I'm not expecting that I can make you feel good about the 
situation. ;-) But perhaps the UTC position will seem a little less 
baffling.

--Ken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180719/e43411fd/attachment.html>


More information about the Unicode mailing list