Unclear text in the UBA (UAX#9) of Unicode 6.3

Eli Zaretskii eliz at gnu.org
Tue Apr 22 11:02:00 CDT 2014


> Date: Mon, 21 Apr 2014 23:25:05 -0700
> From: Asmus Freytag <asmusf at ix.netcom.com>
> Cc: verdy_p at wanadoo.fr, ken at unicode.org, Eli Zaretskii <eliz at gnu.org>,
>         James Clark <jjc at jclark.com>,
>         unicode Unicode Discussion <unicode at unicode.org>
> 
> > And I think I can even invent an example which I cannot parse using
> > your definition:
> >
> >    1(  2[  3(  4]  5)  6)
> >
> > Is looking-at-1 forcing match of 3-and-5?  Or what?
> 
> 
> Let's see what the text gives (before we improve it further).
> 
> 1. -  1( or 3( could match 5) or 6) , 2[ could only match 4]
> 
> a. - we have only one isolating run, so this is a no-op
> b. - all opening characters follow their putative closing characters, so 
> this is a no-op
> d. - at location 5 is the earliest opportunity to match a pair
>       (before we get to 5 we don't have a opening and closing)
> c. - we could match 1( or 3( but we use 3, because it spans less text
> e. , f. - can probably combine these, but 4] is now inside a resolved 
> pair and is ignored.
> 
> Now, when we reach 6) we have another pair, and per d, it's the earliest 
> possible moment
> we can resolve it, so we match 1) and 6).

But that's wrong, isn't it?  If I follow the algorithm in BD16 (which
is really our only reference at this point), I get
this:

    input      results
    1(	       push 1)
    2[	       push 2]
    3(	       push 3)
    4]	       produce a pair 2[ 4] and pop through and including 2]
    5)	       produce 1( 5) and pop the entire stack
    6)	       nothing (remains unmatched)

The reference implementation (after I managed to understand how to
invoke it for this case) agrees with me.

This once again underlines the problem with the original "definition"
in BD16, which does not lend itself to a useful and yet intuitive
notion of what is "right".

> Eli's definition starts
> 
>    A bracket pair is a pair of an opening paired bracket and a closing
>    paired bracket characters within the same isolating run sequence,
>    such that the Bidi_Paired_Bracket property value of the former
>    character or its canonical equivalent equals the latter character or
>    its canonical equivalent, ....
> 
> and continues:
> 
>    ....and all the opening and closing bracket
>    characters in between these two are balanced.
> 
> That continuation we found out was incorrect, so we would need to fix it.

Indeed.

> Here's an attempt:
> 
>     ... subject to the following conditions:
> 
> 
> 	a. a match is attempted at the left-most closing bracket character
> 	   unmatched at this point
> 	b. the closest earlier matching opening bracket, that is unmatched
>             at this point is used to form the pair
>          c. any unmatched bracket character enclosed in a pair is ignored
>             for further matching
> 	d. matching ends when no more pairs can be formed

I agree, but let me try to say the same more concisely:

   A bracket pair is a pair of an opening paired bracket and a closing
   paired bracket characters within the same isolating run sequence,
   such that the Bidi_Paired_Bracket property value of the former
   character or its canonical equivalent equals the latter character
   or its canonical equivalent, and provided that a closing bracket is
   matched to the closest match candidate, disregarding any candidates
   that either already have a closer match, or are enclosed in a
   matched pair of other 2 bracket characters.



More information about the Unicode mailing list