Fault in Bidi Algorithm at BD16
kenwhistler at sonic.net
Sun Mar 20 14:49:19 CDT 2022
On 3/20/2022 10:58 AM, Richard Wordingham via Unicode wrote:
> 2. Compare the closing paired bracket being inspected or its
> canonical equivalent to the bracket in the current stack element."
> It was picked up by line 312 of BidiCharacterTests.txt:
> 0061 0020 2329 0062 002E 0031 3009;1;1;2 2 2 2 2 2 2;0 1 2 3 4 5 6
> This line primarily checks that U+2329 and U+3009 are identified as a
> 'bracket pair'. bpb(U+2329) is U+232A, whose canonical decomposition
> is U+3009. However, the step*numbered* '2' is non-determistic; it
> contains the word 'or'.
I'm not seeing it. The inclusion of an "or" there does not make this
Yes, the text is not pedantically precise, I suppose, but most people
have not had trouble interpreting what is intended. If your candidate
closing bracket (or the canonical equivalent of your candidate closing
bracket) matches the closing bracket match mapping detailed in
BidiBrackets.txt for the opening bracket candidate on the stack, then
you have a bracket match.
This affects precisely 2329 and 232A because those are the *only*
brackets listed in BidiBrackets.txt that have canonical decomposition
mappings. And it is vanishingly unlikely that the UTC is ever going to
add more such paired brackets with canonical decomposition mappings.
> The simple, robust solution is to change 'or
> its canonical equivalent' to 'and its canonical equivalents'.
I don't think that actually would clarify the text. And we shouldn't
imply more of a requirement to import normalization into UBA than is
> also avoids the risk of 'its canonical equivalent' being interpreted as
> the result of the function to_NFC or to_NFD.
I don't see the distinction here. The NFC *and* NFD form of 2329 are
both 3008. The NFC *and* NFD form of 232A are both 3009. You could use
either of those and still end up with the right result for the bracket
match. But why bother?
The BidiReference code just does a hard-coded additional test (and
explains why). For this particular edge case, that works just as well,
is just as robust (see above assertion that UTC isn't going to add more
exceptions to be dealt with), and would be *faster* than introducing a
step to normalize the brackets:
if ( ( bracketData.bracket == closingcp ) ||
( ( bracketData.bracket == 0x232A ) && ( closingcp ==
0x3009 ) ) ||
( ( bracketData.bracket == 0x3009 ) && ( closingcp ==
0x232A ) ) )
Note the logical OR's there. If condition_a OR condition_b OR
condition_c then you have a match. That is completely deterministic in
> It feels simpler to work with the NFC or NFD equivalents of the
> candidate opening and closing brackets at both the first and last of
> the quoted steps.
More information about the Unicode