Fault in Bidi Algorithm at BD16

Ken Whistler kenwhistler at sonic.net
Sun Mar 20 14:49:19 CDT 2022


On 3/20/2022 10:58 AM, Richard Wordingham via Unicode wrote:
> 2.  Compare the closing paired bracket being inspected or its
>      canonical equivalent to the bracket in the current stack element."
> It was picked up by line 312 of BidiCharacterTests.txt:
> 0061 0020 2329 0062 002E 0031 3009;1;1;2 2 2 2 2 2 2;0 1 2 3 4 5 6
> This line primarily checks that U+2329 and U+3009 are identified as a
> 'bracket pair'.  bpb(U+2329) is U+232A, whose canonical decomposition
> is U+3009.  However, the step*numbered*  '2' is non-determistic; it
> contains the word 'or'.

I'm not seeing it. The inclusion of an "or" there does not make this 

Yes, the text is not pedantically precise, I suppose, but most people 
have not had trouble interpreting what is intended. If your candidate 
closing bracket (or the canonical equivalent of your candidate closing 
bracket) matches the closing bracket match mapping detailed in 
BidiBrackets.txt for the opening bracket candidate on the stack, then 
you have a bracket match.

This affects precisely 2329 and 232A because those are the *only* 
brackets listed in BidiBrackets.txt that have canonical decomposition 
mappings. And it is vanishingly unlikely that the UTC is ever going to 
add more such paired brackets with canonical decomposition mappings.

>   The simple, robust solution is to change 'or
> its canonical equivalent' to 'and its canonical equivalents'.
I don't think that actually would clarify the text. And we shouldn't 
imply more of a requirement to import normalization into UBA than is 
actually needed.
>   That
> also avoids the risk of 'its canonical equivalent' being interpreted as
> the result of the function to_NFC or to_NFD.

I don't see the distinction here. The NFC *and* NFD form of 2329 are 
both 3008. The NFC *and* NFD form of 232A are both 3009. You could use 
either of those and still end up with the right result for the bracket 
match. But why bother?

The BidiReference code just does a hard-coded additional test (and 
explains why). For this particular edge case, that works just as well, 
is just as robust (see above assertion that UTC isn't going to add more 
exceptions to be dealt with), and would be *faster* than introducing a 
step to normalize the brackets:

         if ( ( bracketData.bracket == closingcp ) ||
              ( ( bracketData.bracket == 0x232A ) && ( closingcp == 
0x3009 ) ) ||
              ( ( bracketData.bracket == 0x3009 ) && ( closingcp == 
0x232A ) ) )

Note the logical OR's there. If condition_a OR condition_b OR 
condition_c then you have a match. That is completely deterministic in 
this case.


> It feels simpler to work with the NFC or NFD equivalents of the
> candidate opening and closing brackets at both the first and last of
> the quoted steps.

More information about the Unicode mailing list