Fault in Bidi Algorithm at BD16

Richard Wordingham richard.wordingham at ntlworld.com
Sun Mar 20 15:43:38 CDT 2022


On Sun, 20 Mar 2022 12:49:19 -0700
Ken Whistler via Unicode <unicode at corp.unicode.org> wrote:

> Richard,
> 
> On 3/20/2022 10:58 AM, Richard Wordingham via Unicode wrote:
> > 2.  Compare the closing paired bracket being inspected or its
> >      canonical equivalent to the bracket in the current stack
> > element."
> >
> > It was picked up by line 312 of BidiCharacterTests.txt:
> >
> > 0061 0020 2329 0062 002E 0031 3009;1;1;2 2 2 2 2 2 2;0 1 2 3 4 5 6
> >
> > This line primarily checks that U+2329 and U+3009 are identified as
> > a 'bracket pair'.  bpb(U+2329) is U+232A, whose canonical
> > decomposition is U+3009.  However, the step*numbered*  '2' is
> > non-determistic; it contains the word 'or'.  
> 
> I'm not seeing it. The inclusion of an "or" there does not make this 
> non-deterministic.

"Do A or B" is not deterministic.  In general, there may be several
different ways of achieving the same effect.

> Yes, the text is not pedantically precise, I suppose, but most people 
> have not had trouble interpreting what is intended. If your candidate 
> closing bracket (or the canonical equivalent of your candidate
> closing bracket) matches the closing bracket match mapping detailed
> in BidiBrackets.txt for the opening bracket candidate on the stack,
> then you have a bracket match.

How do you collect the statistics?  I would have thought you would have
been unlikely to know about such matters, for the errors should get
caught by the conformance tests.  At that point the penny drops.  And
with English, one needs to be careful with quantifiers like 'or'; it
seems clear to me that not even all native speakers interpret
combinations the same.

By the time one gets to N0, the intelligibility of the UBA is rapidly
falling off.  (I'm not confident that that's curable.)  And we know that
people do code up Unicode algorithms without understanding them.  The
UBA is one of the more complex algorithms, which is probably why it has
such a large set of tests.  The complexity has led to at least one
author leaving a curse in his public code.

> This affects precisely 2329 and 232A because those are the *only* 
> brackets listed in BidiBrackets.txt that have canonical decomposition 
> mappings. And it is vanishingly unlikely that the UTC is ever going
> to add more such paired brackets with canonical decomposition
> mappings.
> 
> >   The simple, robust solution is to change 'or
> > its canonical equivalent' to 'and its canonical equivalents'.  
> I don't think that actually would clarify the text. And we shouldn't 
> imply more o f a requirement to import normalization into UBA than is 
> actually needed.
> >   That
> > also avoids the risk of 'its canonical equivalent' being
> > interpreted as the result of the function to_NFC or to_NFD.  
> 
> I don't see the distinction here. The NFC *and* NFD form of 2329 are 
> both 3008. The NFC *and* NFD form of 232A are both 3009. You could
> use either of those and still end up with the right result for the
> bracket match. But why bother?

U+232A is canonically equivalent to U+3009, but is neither
to_NFC(U+3009) nor to_NFD(U+3009).  Thus, it's not immediately obvious
that the 'canonical equivalent of U+3009' means U+232A.

> The BidiReference code just does a hard-coded additional test (and 
> explains why). For this particular edge case, that works just as
> well, is just as robust (see above assertion that UTC isn't going to
> add more exceptions to be dealt with), and would be *faster* than
> introducing a step to normalize the brackets:
> 
>          if ( ( bracketData.bracket == closingcp ) ||
>               ( ( bracketData.bracket == 0x232A ) && ( closingcp == 
> 0x3009 ) ) ||
>               ( ( bracketData.bracket == 0x3009 ) && ( closingcp == 
> 0x232A ) ) )
> 
> Note the logical OR's there. If condition_a OR condition_b OR 
> condition_c then you have a match. That is completely deterministic
> in this case.

The reference code is now in a place widely consider a threat to
networks!

Richard.



More information about the Unicode mailing list