Unclear text in the UBA (UAX#9) of Unicode 6.3

Tue Apr 22 04:19:37 CDT 2014

On Mon, Apr 21, 2014 at 11:25:05PM -0700, Asmus Freytag wrote:
> On 4/21/2014 8:32 PM, Ilya Zakharevich wrote:
> >On Mon, Apr 21, 2014 at 06:08:12PM -0700, Asmus Freytag wrote:
> >>Here's the text I supplied, with numbers added for discussion. It
> >>definitely needs some
> >>editing, but the point of the exercise would be to see what:
> >>
> >>     1.  A bracket pair is a pair of characters consisting of an opening
> >>          paired bracket and a closing paired bracket such that the
> >>          Bidi_Paired_Bracket property value of the former equals the
> >>latter,
> >>          subject to the following constraints.
> >>
> >>         a - both characters of a pair occur in the same isolating run
> >>    sequence
> >>         b - the closing character of a pair follows the opening character
> >>         c - any bracket character can belong at most to one pair, the
> >>    earliest possible one
> >>         d - any bracket character not part of a pair is treated like an
> >>    ordinary character
> >>         e - pairs may nest properly, but their spans may not overlap
> >>    otherwise
> >>
> >>
> >>     2.  Bracket characters with canonical decompositions are
> >>supposed to be treated
> >>          as if they had been normalized, to allow normalized and
> >>non-normalized text
> >>         to give the same result.
> >>
> >>
> >>c) needs rewording, because it is not correct
> >>
> >>The BD16 examples show
> >>
> >>	a ( b ) c ) d		2-4
> >>	a ( b ( c ) d		4-6
> >>
> >> From that, it follows that it's not the earliest but the one with the smallest span.
> >Sorry, I do not see any definition here.  Just a collection of words
> >which looks like a definition, but only locally…
> Thank you for the high praise. :?
> 
> Now you deleted language which I will restore here, put into a
> reasonable order and complete the suggested
> edit on "c"
> 
> d) brackets are resolved at the earliest opportunity, starting from the beginning of the text.
> 
> c) if there are two possible ways to resolve a pair, the one spanning less text is used.
> 
> f) unpaired bracket characters remaining inside a resolved bracket pair are treated as
> ordinary characters (get ignored for bracket matching purposes).

As I said, to me it is just a combination of words, and I have no
idea how to assign meaning to them.

> >And I think I can even invent an example which I cannot parse using
> >your definition:
> >
> >   1(  2[  3(  4]  5)  6)
> >
> >Is looking-at-1 forcing match of 3-and-5?  Or what?

> Let's see what the text gives (before we improve it further).
> 
> 1. -  1( or 3( could match 5) or 6) , 2[ could only match 4]
> 
> a. - we have only one isolating run, so this is a no-op
> b. - all opening characters follow their putative closing
> characters, so this is a no-op
> d. - at location 5 is the earliest opportunity to match a pair
>      (before we get to 5 we don't have a opening and closing)

Why not match at location 4 then?!

And with
  1(  2[  3(  4]  5)  6)   1a[  2a(  3a[  4a)  5a]  6a]
would you match 2a with 4a on this step?

I think the crucial problem is with

  1(  2[  3(  4]  5) 5b]  6)

I have two possible interpretations: one matches 2 with 5b, another
leaves 2 unmatched.

=======================================================

Anyway, here is a writeup of one of possible interpretations:

========= Part I

As I said before, for a string consisting of "(" and ")" only, there is a
notion of (Eli’s match):

  depth-match with unmatched "(" and ")" at top-level only.

   In case this is unclear:
    (A) the string is broken into pieces ")", "(", and depth-matched pieces;
    (B) every piece "(" is after every piece ")".
   Such a match is unique (unless I’m mistaken), and for every matched guy
   we know where is the other (matching) character of the pair.

========= Part II

Now allow also secondary parens, "[" and "]", in the string.
 (-1)  Match "(" and ")" as above (ignoring "[" and "]").
  (0)  Match "[" and "]" as above at toplevel (remove matched pairs "("
       and ")" and everything between them).
  (1)  Do the same inside toplevel pairs of matching "(" and ")"
        (remove matched pairs "(" and ")" of 2nd level and everything
         between them).
  (…)  Etc

========== Part III

Now allow ternary parens, "{" "}"; etc.

This way, if we have a HIERARCHY of paired Unicode characters, there is
a unique notion of
   depth-match with unmatched delimiters at RELATIVE top-level only.
(Here RELATIVE means: w.r.t. delimiters of higher precedence).

=========== Part IV

Now, define hierarchy DYNAMICALLY separately for every position in
the string: if match-with-what/non-match is already decided, given a
position in a string with delimiter:

  (a) write delimiters which enclose the position in order of deeper=later;
  (b) add the delimiter-at-the-position at the end;
  (c) remove duplicates.
  (d) other types of delimiter are later in the hierarchy (in arbitrary order).

This is the dynamically-defined hierarchy at the position.

=========== Part V

The decision of match-with-what/non-match is called OK if every position
with delimiter is matched/non-matched according to the Part III w.r.t.
hierarchy defined in Part IV.

Conjecture:
===========

For every string of delimiters, the OK decision of Part V is unique.

Ilya