Bidi Parenthesis Algorithm and BidiCharacterTest.txt
Eli Zaretskii
eliz at gnu.org
Tue Oct 14 06:48:56 CDT 2014
Hi,
One of the test cases in BidiCharacterTest.txt seems to me to
contradict the description of the rules N0 through N2 of the UBA. Or
maybe I'm missing something.
Here are the details.
The test case in question, on line 114 of BidiCharacterTest.txt, is as
follows:
0061 0028 0028 007B 0062 2680 005B 005D 0029 007D 005B 0063 005B 005D 005D 05D0 0029;1;1;2 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1;16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
1 0
The first field, up to the 1st semicolon, is the sequence of
characters given by their Unicode codepoints, in the logical order.
Translated into readable text, it looks like this:
a ( ( { b ⚀ [ ] ) } [ c [ ] ] א )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
where I inserted blanks between every 2 characters, for better
readability, and added position numbers. The next field of the test
case data, whose value is 1, specifies that the paragraph direction is
RTL, i.e. the embedding level is 1.
Let me now present the application of N0 through N2, as I understand
them, to this text. (Since there are no explicit directional codes
here, and no weak characters, we can skip all the rules before N0.)
The results of identifying bracket pairs, per BD16, sorted by the
position of the opening bracket, are as follows:
2 and 17
3 and 9
7 and 8
11 and 15
13 and 14
Applying N0, we see that:
. The pair 2-17 encloses 'א', which matches the embedding direction,
so N0b instructs to resolve this pair as matching the embedding
direction, i.e. R.
. The pair 3-9 encloses 'b', whose direction is opposite to the
embedding direction, and has 'a' before the opening bracket, so
N0c1 says we should resolve this pair as L, the direction opposite
to the embedding one.
. The pair 7-8 encloses no strong characters, so it is left as is.
. The pair 11-15 encloses 'c' and is preceded by 'b', so N0c1 again
says to resolve this pair as L.
. The pair 13-14 encloses no strong characters, so is left alone.
Therefore, the result after N0 is this:
a ( ( { b ⚀ [ ] ) } [ c [ ] ] א )
L R L N L N N N L N L L N N L R R
Applying N1, we then obtain the following result:
a ( ( { b ⚀ [ ] ) } [ c [ ] ] א )
L R L L L L L L L L L L L L L R R
There are no neutrals left, so N2 doesn't need to be applied.
Now I2 gives the following resolved levels:
a ( ( { b ⚀ [ ] ) } [ c [ ] ] א )
2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1
However, BidiCharacterTest.txt gives a different sequence of resolved
levels:
2 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1
Could someone please point out what am I missing or doing incorrectly?
Thanks in advance.
More information about the Unicode
mailing list