Unclear text in the UBA (UAX#9) of Unicode 6.3

Ilya Zakharevich nospam-abuse at ilyaz.org
Mon Apr 21 18:41:40 CDT 2014

On Mon, Apr 21, 2014 at 02:44:14PM -0700, Asmus Freytag wrote:
> On 4/21/2014 1:54 PM, Philippe Verdy wrote:
> >My intent was not to demonstrate a bug in the algorithm, I have
> >not even claimed that, but to make sure that (less common) usages
> >of paired brackets that do not obey to a pure hierarchy (because
> >these notations use different type of brackets, they are not
> >ambiguous) but still preserve their left vs. right (or open vs.
> >close) semantic.

> OK, so this has nothing to do with "unclear text".

Asmus, I cannot agree with this.  I think Philippe’s message is on topic.

  [Below, I completely ignore BIDI part of the specification, and
   concentrate ONLY on the parens match.  I do not understand why this
   question is interlaced with BIDI determination; I trust that it is.]

I suspect Philippe was motivated by a kinda-cowboy attitude which both Eli
and you show to the problem of “parentheses match” (and I suspect this
because THAT is my feeling ;-).  You give two (IMO, informal) interpretations
of what the algorithm-based description says.  These two interpretations
are obviously non-compatible (or at least not necessarily clearly stated).

As Eli said it: “bracket pair … a concept as easy and widely known/used as
this would need such an obscure definition ”.  Just for background: the first
theorem on the “Applied Algebra” class taught by Yu.I.Manin was about
parentheses match (it stated that the proper match is unique as far as it
exists).  This statement is a (tiny) mess to prove, but at least it should
look very plausible to unwashed masses.  (One corollary is that “the
earliest possible one” from your interpretation is not actually needed.)

The problems appear when one wants to allow non-matching parentheses as
as well as matched pairs.  [If one fixes Eli’s description so that “a pair”
and “matched” are complete synonims, then] what Eli conveys is that
all non-matching parentheses MUST appear “on top level” only.  This is
workable (meaning the match is still unique).

Your approach gives a circular definition: to define which paren chars match
one must know which ones DO NOT match, and the recursion is not terminated.
This is exactly what Philippe’s example shows.


My understanding is that Unicode is trying to do is to collect the best
practical ways to treat multi-Language texts (without knowing fine details
about the languages actually used in the text).  It may be that what is
“well understood” today IS only the case where non-matched parens appear on
top-level only.

So one may ask: what will be the result of the CURRENT UNICODE parsing applied
to Phillipe’s example?

  This is an [«] example [»] for demonstration only.

By Eli’s interpretation, it contains no matched parens.  In one reading of
your interpretation, the external-[] and guillemets would match, and
internal-][ would be non-matching ones.

If one could “show” that in majority of cases that is what the writer’s
intent was, THEN your interpretation would be “the best
practical ways to treat multi-Language texts”, and it may be prefered to
current-algorithmic-description.  THIS is why I think the message was on topic.

But this is all a very shaky ground…


More information about the Unicode mailing list