UAX #9 (Bidirectional algorithm) reference implementations
Fabian Giesen
fabiang at radgametools.com
Thu Dec 8 20:41:35 CST 2016
I'm currently implementing the bidirectional algorithm and, while
testing my version, ran into some issues with the provided reference
implementations. (http://www.unicode.org/Public/PROGRAMS/)
1. BidiReferenceJava supports Unicode 6.3.0, but has not been updated
for later versions.
In particular, the changes from revision 33 of UAX#9 (corresponding to
Unicode 8.0.0; most notably, limitation of maximum depth for nested
brackets in the PBA, and the rules for handling NSMs following brackets
in rule N0) are missing.
Now the README of BidiReferenceJava mention that it implements Unicode
6.3.0 (and hasn't been updated since), but this should probably be made
more explicit. Maybe move the current implementation to a "6.3.0"
subdirectory? (Similar to BidiReferenceC)
---
2. I am reasonably certain I found a bug in BidiReferenceC (version 9.0.0).
Consider these two test cases: (in the same format as
BidiCharacterTest.txt):
0061 0028 0062 0029 0300 05D0;1;1;2 2 2 2 2 1;5 0 1 2 3 4
0061 0028 0062 0029 001B 0300 05D0;1;1;2 2 2 2 x 2 1;6 0 1 2 3 5
This concerns runs of NSMs following a paired bracket, and how they
interact with BNs (or, in the right circumstances, other types removed
by Rule X9).
The first is "a(b)<NSM>A" (A denoting a R-class character) in a RTL
embedding. This test, when run through BidiReferenceC, produces the
expected result.
The key steps are as follows:
1. Classification before the weak types phase is
L ON L ON NSM R
2. Weak types phase produces
L ON L ON ON R
3. Rule N0 resolves bracket pair (2,4) to L; the original NSM
following the closing bracket gets set to L (as per the last
clause of rule N0) as well.
L L L L L R
4. Level assignment produces the given expected result
The second test simply adds an ASCII escape character (class BN) before
the NSM. Here, BidiReferenceC produces this result:
Text: 0061 0028 0062 0029 001B 0300 05D0
Bidi_Class: L L L L BN R R
Levels: 2 2 2 2 x 1 1
Exp Levels: 2 2 2 2 x 2 1
Mismatches: ^
Runs: <R------------------------------R>
Order: [6 5 0 1 2 3]
Exp Order: [6 0 1 2 3 5]
which I believe to be incorrect. The only difference to the previous run
is the presence of the BN-type character before the NSM (which should
not matter, since it's supposed to be removed by Rule X9 before we ever
enter the weak types phase).
The problem appears to be around brrule.c:4376, in the function
"br_SetBracketPairBC". The code is written to detect a run of NSMs
following the brackets, but does not skip over deleted characters (which
are denoted by having "level == NOLEVEL").
Can anyone confirm whether my interpretation of the rules is correct and
this is an actual bug in BidiReferenceC?
Thanks,
-Fabian
More information about the Unicode
mailing list