Unclear text in the UBA (UAX#9) of Unicode 6.3

Asmus Freytag asmusf at ix.netcom.com
Wed Apr 23 11:21:04 CDT 2014

On 4/23/2014 12:35 AM, Ilya Zakharevich wrote:
> On Tue, Apr 22, 2014 at 09:06:27AM -0700, Asmus Freytag wrote:
>> if you read UAX#9, the way the algorithm works is by pushing openers
>> on a stack, then, on finding the first closer, going down the stack
>> and attempting to locate a match, then, on finding a match,
>> discarding any enclosed openers, on not finding a match, discarding
>> the closer.
> I think I LOVE this definition.  Simple, beautiful, and IMO following
> people’s expectations very closely.

I hadn't intended it as a definition, but let's see how it would work as 
one. The "stack" isn't necessary:

    The algorithm works by finding the first closer, looking back
    and attempting to locate a match, then, on finding the nearest match,
    discarding any enclosed openers, otherwise on not finding a match, discarding
    the closer.

That an implementation uses a "stack" to avoid having to "look back" is 
a detail that has no place in a well-crafted definition.

It leaves a few things unstated - that all has to take place inside the 
same isolating run, and, that, on reaching the end, only the matched 
pairs are remembered (all unmatched openers are discarded). (And we need 
to say what "discarded" means).

While it's a statement of an algorithm, it's not obfuscatory, and it's 
relatively easy to consider alternate implementation strategies.
> Here is what “theoretizing” gives:
>   a parsing is good if it satisfies all conditions below:
>     0) Some delimiters in the string are marked as “non-matching”; the rest
>        is broken into disjoint “matched” pairs;
>     MATCH) A “matched” pair consists of an open-delimiter and matching close-
>            delimiter (in this order in the string).
>     NEST) “Matched” pairs are properly nested (meaning that 2 pairs cannot be
>           positioned as Open1 Open2 Close1 Close2 in the string order).
>     MINLEN) “Inside” a “matched” pair, every delimiter which could match elements
>             of the pair but is marked as “non-matching” must nest inside
>             some deeper-nested “matched” pair.
> (I hope that the meaning of the word “inside” in MINLEN is clear.)
>     GREED) Given any close-delimiter marked as “non-matching”, its
>            pre-context does not contain any open-delimiter which could
>            match it.
>       Here pre-context of a position is a concatenation of substrings of the
>       initial string:
>       • Take the most deeply nested “matched pair” containing the position
>         (if none, the whole string);
>       • take the part of the string inside this pair AND before the position;
>       • remove all “matched” pairs completely contained insidde this substring
>         together with what they enclose.

This is a very nice formal definition. I'm surprised that your "GREED"
statement needs such a complex auxiliary concept (pre-context).

Can you explain why, if you make "pre-context" simply the part of the
whole string that precedes the unmatched close-delimiter, the words
"which could match it" are insufficient?

Any opener, that's inside a matched pair entirely in the pre-context
(my def) would be ineligible because of NEST, so you don't have
to "remove" it from the pre-context (your def).

Is there something I'm missing?
> Ilya
> P.S.  Judging by another message of yours, for you “theoretizing” is a
>        4-letter word…  Oh well…
It can be - not in the sense you used it in this post.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140423/00c32b80/attachment.html>

More information about the Unicode mailing list