Bidi Parenthesis Algorithm and BidiCharacterTest.txt

Whistler, Ken ken.whistler at sap.com
Wed Oct 15 13:55:12 CDT 2014


> > I disagree that this makes N0 a "recursive" rule. It is a rule with repeatedly
> > applicable subparts. And like nearly all the rules in the UBA (except ones
> > which explicitly state that they apply to *original* Bidi_Class values,
> > which thus have to be stored across the life of the processing of
> > the string in question), all rules apply to the *current* Bidi_Class
> > values of the examined context.
> 
> Can you point out where this is stated in the UBA?

It isn't explicitly stated, as I think is evident from folks agreeing that
further clarification of the text would be helpful. And I am not an
author/editor of UAX #9 -- the complaints about lack of clarity need
to go to them, preferably through explicit text improvement suggestions
provided as feedback on PRI #279:

http://www.unicode.org/review/pri279/

And I see that Laurentiu Iancu has already helpfully summarized this text issue
and already provided some explicit feedback there -- so I think the upcoming UTC is
covered for that.

I did, however, write a reference implementation for the UBA,
including the bidi bracket pairing for UBA 6.3 -- and reading the
UBA spec to do so (without looking at anybody else's implementation),
I came to the same conclusion regarding the processing of Bidi_Class
values in rule N0 as the other author of a reference implementation
did.

> 
> According to my reading of the UBA, only W7 could qualify as something
> similar to the "recursive" interpretation of N0.  All the other rules
> are either defined in a way that the "recursion" cannot happen
> (because the conditions for applying the rule disappear after it is
> applied once), or explicitly speak about a sequence of similar
> characters whose bidi types are modified in the same manner.
> 
> > Trace: Exiting br_SortPairList
> > Pair list: {1,16} {2,8} {6,7} {10,14} {12,13}
> > Debug: Strong direction e between brackets
> > Debug: Strong direction o between brackets
> > Debug: No strong direction between brackets
> > Debug: Strong direction o between brackets
> > Debug: No strong direction between brackets
> 
> This doesn't explain _why_ the decision was that the direction between
> brackets was one or the other.  Which is at the core of the issue at
> hand.  So this debugging output doesn't really help here.

Well, as author of the code that produced that debug output, I
can agree that the debug output doesn't explain *why* it made the decisions
it did -- I didn't think it was necessary. But I can see that this is
a confusing part of the algorithm, and it would be fairly simple
to further enhance the debugging output the reference implementation provides in future
revisions, so I will endeavor to do so.

In the meantime, however, the source code for that reference
implementation is posted and is easily available. The
relevant part of the rule processing we are talking about
can be found in brrule.c. 

http://www.unicode.org/Public/PROGRAMS/BidiReferenceC/6.3.0/source/brrule.c

The function br_ResolvePairEmbeddingLevels()
(line 4368), reads down the sorted pair list, and processes each
pair in sequence, calling the function br_ResolveOnePair() (line 4235).
And if you examine the code in br_ResolveOnePair, you can see that
it simply searches the string between the bracket pair for a *current*
Bidi_Class value that counts as a strong value, and then, if necessary,
searches back to find a *current* Bidi_Class value to the left of the
left bracket pair that counts as a strong value. And depending on
the results of those searches, it then resolves the Bidi_Class of
the bracket pair itself.

Following that logic, then, it is pretty clear that the behavior of
each successive call to br_ResolveOnePair could, in principle, depend
in checking a Bidi_Class value that had been changed by a prior
call to br_ResolveOnePair.

So yes, multiple, successive calls that depend on the results of the
prior rule subpart.

> 
> In any case, when designing an implementation, one normally expects to
> read some formal requirements, not learn those requirements from
> another implementation.

The UBA has *always* been a difficult algorithm to write a clear
and complete specification for. One of the reasons why several of
us, over the years, have done the work to write *reference* implementations
for it is so that examination of the exact behavior of those coded
implementations for the various odd edge cases of the algorithm can
be referred to in instances (just like this one) where the implications of
a specific attempt to write out the specification of a rule in English
in UAX #9 might still have some lingering ambiguity in it.

And in turn, the experience of the writers of implementations has often
come back around and led to suggestions to improve the wording
of the specification, precisely because questions arose as to what
the correct choices were for edge case behavior.

> 
> Anyway, I'm glad we all agree that, once again, the new additions to
> the UBA, and the BPA-related ones in particular, are not described
> well enough to avoid misinterpretations and misunderstanding such as
> this one, and that the language should be improved and clarified,
> hopefully sooner rather than later.  I've just lost 20 hours of work
> due to that.

I'm sure that the editors of UAX #9 will take this on board, and let's hope
that some clearer wording can be agreed on. 

--Ken




More information about the Unicode mailing list