Unclear text in the UBA (UAX#9) of Unicode 6.3
asmusf at ix.netcom.com
Mon Apr 21 20:08:12 CDT 2014
I appreciate your taking the time to take apart Philippe's message. That
aspect of it was not obvious to me.
PS: more comments below
On 4/21/2014 4:41 PM, Ilya Zakharevich wrote:
> On Mon, Apr 21, 2014 at 02:44:14PM -0700, Asmus Freytag wrote:
>> On 4/21/2014 1:54 PM, Philippe Verdy wrote:
>>> My intent was not to demonstrate a bug in the algorithm, I have
>>> not even claimed that, but to make sure that (less common) usages
>>> of paired brackets that do not obey to a pure hierarchy (because
>>> these notations use different type of brackets, they are not
>>> ambiguous) but still preserve their left vs. right (or open vs.
>>> close) semantic.
>> OK, so this has nothing to do with "unclear text".
> Asmus, I cannot agree with this. I think Philippe’s message is on topic.
> [Below, I completely ignore BIDI part of the specification, and
> concentrate ONLY on the parens match. I do not understand why this
> question is interlaced with BIDI determination; I trust that it is.]
It really isn't.
The result of detecting pairs allows one to improve on assigning
directionality to the
members of the pair, so that they would match (as expected).
This works only for a (hopefully common) subset of all possible uses.
Like the overall bidi algorithm (UBA) the paired bracket algorithm (PBA)
as a heuristic that frees the author from having to explicitly declare
for every bit of text, by providing a default directionality that should
work with most
text. Exceptional cases then, and ideally only those, would need
overrides and similar
> I suspect Philippe was motivated by a kinda-cowboy attitude which both Eli
> and you show to the problem of “parentheses match” (and I suspect this
> because THAT is my feeling ;-). You give two (IMO, informal) interpretations
> of what the algorithm-based description says. These two interpretations
> are obviously non-compatible (or at least not necessarily clearly stated).
Eli and I both believe that a non-algorithmic definition should be
possible, and that
it is preferred to the current algorithmic definition. Not least,
because with the algorithmic
definition, it is not possible for anyone, by inspection, to be sure
that they understand
what the outcome would be. This is unacceptable, because authors of text
only implementers of the PBA) need to be able to predict where the
heuristic fails and
the text needs additional markup.
This is not a trivial point - not everybody creates text at an editor
where they can
observe the results immediately and take corrective actions. Text is
also edited in
environments that do not do bidi processing (e.g. certain kinds of
editing) or created as result of program action. Knowing when to insert
not to insert) bidi controls under program action would benefit from a
that can be read independently of the implementation of the PBA.
> As Eli said it: “bracket pair … a concept as easy and widely known/used as
> this would need such an obscure definition ”. Just for background: the first
> theorem on the “Applied Algebra” class taught by Yu.I.Manin was about
> parentheses match (it stated that the proper match is unique as far as it
> exists). This statement is a (tiny) mess to prove, but at least it should
> look very plausible to unwashed masses. (One corollary is that “the
> earliest possible one” from your interpretation is not actually needed.)
( a [ b ) c ] ?
The PBA matches the () but not the .
Some statement about "earliest" is needed, to select between () and ,
language contains a mistake.
> The problems appear when one wants to allow non-matching parentheses as
> as well as matched pairs. [If one fixes Eli’s description so that “a pair”
> and “matched” are complete synonyms, then] what Eli conveys is that
> all non-matching parentheses MUST appear “on top level” only. This is
> workable (meaning the match is still unique).
Eli's definition was:
A bracket pair is a pair of an opening paired bracket and a closing
paired bracket characters within the same isolating run sequence,
such that the Bidi_Paired_Bracket property value of the former
character or its canonical equivalent equals the latter character or
its canonical equivalent, and all the opening and closing bracket
characters in between these two are balanced.
( a [ b ) c ] ?
his definition contains no bracket pair, but the example in UAX#9 says
that the ()
should form a pair.
The purpose of providing my wording was to do precisely the comparison you
have been attempting here, so we end up with language that is an actual
merely and attempted) restatement of the algorithmic definition.
> Your approach gives a circular definition: to define which paren chars match
> one must know which ones DO NOT match, and the recursion is not terminated.
> This is exactly what Philippe’s example shows.
Here's the text I supplied, with numbers added for discussion. It
definitely needs some
editing, but the point of the exercise would be to see what:
1. A bracket pair is a pair of characters consisting of an opening
paired bracket and a closing paired bracket such that the
Bidi_Paired_Bracket property value of the former equals the
subject to the following constraints.
a - both characters of a pair occur in the same isolating run
b - the closing character of a pair follows the opening character
c - any bracket character can belong at most to one pair, the
earliest possible one
d - any bracket character not part of a pair is treated like an
e - pairs may nest properly, but their spans may not overlap
2. Bracket characters with canonical decompositions are supposed
to be treated
as if they had been normalized, to allow normalized and
to give the same result.
c) needs rewording, because it is not correct
The BD16 examples show
a ( b ) c ) d 2-4
a ( b ( c ) d 4-6
From that, it follows that it's not the earliest but the one with the smallest span.
What was intended was to cover the example:
a ( b [ c ) d ]
this would become (something like)
d) brackets are resolved at the earliest opportunity, starting from the beginning of the text.
f) unpaired bracket characters remaining inside a resolved bracket pair are treated as
ordinary characters (get ignored for bracket matching purposes).
Now, I do not see the recursion that you claim.
> My understanding is that Unicode is trying to do is to collect the best
> practical ways to treat multi-Language texts (without knowing fine details
> about the languages actually used in the text). It may be that what is
> “well understood” today IS only the case where non-matched parens appear on
> top-level only.
I don't know about "top level" - brackets improperly nested (and
therefore unmatched within
the enclosing bracket pair) are simply get ignored for resolving bracket
> So one may ask: what will be the result of the CURRENT UNICODE parsing applied
> to Phillipe’s example?
> This is an [«] example [»] for demonstration only.
> By Eli’s interpretation, it contains no matched parens. In one reading of
> your interpretation, the external- and guillemets would match, and
> internal-][ would be non-matching ones.
Neither. The PBA would match the two pairs of [ ].
Note - in order to make that claim, I'm using the examples from UAX#9,
because I cannot
run the algorithm in my head -- so I'm implicitly trusting the examples.
> If one could “show” that in majority of cases that is what the writer’s
> intent was, THEN your interpretation would be “the best
> practical ways to treat multi-Language texts”, and it may be prefered to
> current-algorithmic-description. THIS is why I think the message was on topic.
This part, that is "whether this is Best" is the one that's off topic.
Eli and I are *exclusively* interested in getting the best possible
statement for a definition
that matches that algorithm and can actually be parsed by humans.
Throwing in any discussion of whether the goal of the algorithm is the
right one is confusing
the issue and I would prefer if he went and started his own discussion.
> But this is all a very shaky ground…
Why, because you can't follow the algorithm?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode