Unclear text in the UBA (UAX#9) of Unicode 6.3
verdy_p at wanadoo.fr
Mon Apr 21 15:54:16 CDT 2014
My intent was not to demonstrate a bug in the algorithm, I have not even
claimed that, but to make sure that (less common) usages of paired brackets
that do not obey to a pure hierarchy (because these notations use different
type of brackets, they are not ambiguous) but still preserve their left vs.
right (or open vs. close) semantic.
However due to the way the algorithm is currently designed, distinct pairs
of brackets still need to be nested hierarchically, and this is not always
And to allow such usages (which does not cause big problems in
unidirectional texts i.e. texts using characters with the same strong
direction, or characters with neutral and weak directions) in bidirectional
texts, we'll necessarily need to use bidi controls however these controls
cannot be so strong that they will break also the necessary embedding
levels intended eah ch type of bracket; even when they do not match in
pairs with the algorithm. As they wil then be trated in isolation (unpaired
forthe hierarchic algorithm) they should still reain their intended RTL or
LTR semantics (and notably their relative placement with things they
surround in non-nested ways, and without being affected as well by
The UBA test cases currently do not cover such uncommon cases; but only
cases with single isolated/unpaired brackets.
I want then to make sure that it will remain possible to write notations
without pure hierarchical nesting (for now they still don't work at all,
the result is already unpredicatable, even with bidi controls).
Also I'm not limited only to punctuation pairs but to any kind of textual
pairs (including XML element tags for example, or quotation marks
delimiting strings in programming languages, or "begin end" keywords in
Pascal or Lua programs, or descriptive expressions in humane languages
(e.g. "[start singing] ... [end of song]" (even if they are not concerned
by punctuation mirroring).
You could see these non-nested usages as internlinear or unstructured, but
in fact they do have a structure which should be preserved and not mixed
randomly by an alforithm unable to decipher their meaning; unless there's
some markup or controls sayng how to treat these items. We should not even
have to use specific parsers for specific notations (like XML); this is a
more generic abstract problem for texts whose content and semantic is not
nested in a pure hierarchical tree but in subtrees with parallel branches,
and whose rendering will then need to preserve these structures.
My initial message contained a very minimal example of what is needed. I'd
like this sample case to be clearly supported in some way without ambiguity.
It will be important for things like songs, poestry, legal texts containng
citations, discussions about another text, threaded discussions; annotating
documents created collaboratively, versioning and showing diffs; and more
exceptionally for interlinear notations (including the inclusion translator
notes; or notes started in one page and continued elsewhere; possibly on
another page; and containing their own sets of bracket pairs)...
In all these usages, the UBA (and the infered effect on mirroring) could
cause havoc. And of course I do not want to define a new technical syntax
using references and identifiers like in XML or JSON to explcit these
structures, for UBA it will be enough if it preserves the intended
direction and mirroring type without having to explicit which bracket pairs
with another one (it should just preserve the start/end or open/close
semantic, leaving the rest to an upper layer syntax if they need it for
more ambiguous cases; a renderer will use any trick it wants to exhibit
this supplementary structure, such as font styles, colors, decorations, or
custom 2D layouts, as provided by a rich text format which is out of scope
of Unicode and UBA). Only herarchical structure is supported in XML or
JSON, but SGML (an HTML) already shows that non-hierarchical structures are
also possible and are effectively used in their supported "content models".
2014-04-21 20:56 GMT+02:00 Asmus Freytag <asmusf at ix.netcom.com>:
> On 4/21/2014 11:23 AM, Philippe Verdy wrote:
> It is on topic because the proposed description attempts to explain how
> paired brackets should match and how this witll then affect the rendering
> in bidirectional contexts. This is exactly the kind of things that are
> difficult because the proposed description assumes that paired brackets are
> organized hierarchically.
> Quote: "both characters of a pair occur in the same isolating run
> sequence" (does not work here sequences are not fully isolated)
> Quote: "any bracket character can belong at most to one pair, the
> earliest possible one" (does not work here, this is not the earliest
> That's OK, it's a limitation of the algorithm, not the description.
> In other words, the algorithm can help set the a better directionality of
> paired (!) brackets, and those are the ones that nest properly.
> What Eli brough to our attention is that the description of this algorithm
> is suboptimal - whether the algorithm could or should be improved is a
> separate matter.
> PS: I think it is unlikely that the UTC will be interested in substantial
> changes to the algorithm, but it should be interested in allowing the
> specification to be less dependent on the sample implementation.
> 2014-04-21 19:48 GMT+02:00 Asmus Freytag <asmusf at ix.netcom.com>:
>> I fail to understand how your post contributes to the topic.
>> The issue was unclear wording of the specification, not deficiencies in
>> the UBA or the PBA in general.
>> Let's keep this discussion limited to issues of wording for the
>> *existing* specification. Feel free to start a new discussion about
>> something else under a new subject.
>> On 4/21/2014 9:18 AM, Philippe Verdy wrote:
>> There are some cases where these rules will not be clear enough. Look at
>> the following where overlaps do occur; but directionality still matters:
>> "This is an [<<] example [>>] for demonstration only."
>> There are two parsings possible if you just consider a hierarchic
>> layout where overlaps are disabled:
>> 1. "This is an [...] for demonstration only.", embedding "<<...>>",
>> itself embedding "] example [" (here the square brackets match externally)
>> 2. "This is an [...] example [...] for demonstration only.", embedding
>> two spans for "<<" and ">>" separately (they also pair externally)
>> Now suppose that the term "example" is translated in Arabic: It is not
>> very clear how the UBA will work while preserving the correct pariing
>> direction of the 3 pairs (one pair is "<<...>>", there are two pairs for
>> "[...]"). Still all 3 pairs have a coherent direction that Bidi-reordering
>> or glyph mirorring should not mix.
>> I see only one solution to tag such text so that it will behave
>> correctly: either the two pairs of square brackets or the pair or
>> guillemets should be encoded with isolated Bidi overrides. But then what is
>> happening to the ordering of the surrounding text?
>> There should be a stable way to encode this case so that UBA will still
>> work in preserving the correct reding order, and the expected semantics and
>> orientation of pairs and the fact that the guillemets are effectively not
>> really embedding the brackets, but the translated word "example".
>> There are several ways to use Bidi-override or Bidi-embedding controls;
>> I don't know which one is better but all of them should still work with
>> UBA. I just hope that the complex cases of the brackets in the middle
>> ("]...[") can be handled gracefully.
>> My opinion would require embedding and isolating the each square
>> bracket, they will no longer match together (externally they are treated as
>> symbols with transparent direction, but how we ensure that the sequence
>> "[<<]" will still occur before the RTL (Arabic) "example" word followed by
>> the sequence "[>>]" and that the rest of the sentence (for demonstration
>> only) will still occur in the correct order : we also have to embed/isolate
>> the "example", or the whole sequence "[<<] example [>>]" so that the main
>> sentence "This is an ... for demonstration only" will stil have a coherent
>> reading direction.
>> Such cases are not so exceptional because they occur to represent two
>> distinct parallel readings of te same text, where in one reading for one
>> kind of pairs will simply treat the other pairs as ignored "transparently".
>> It should be an interesting case to investigate for validating UBA
>> algorithms in a conformance test case.
>> 2014-04-21 16:32 GMT+02:00 Asmus Freytag <asmusf at ix.netcom.com>:
>>> On 4/21/2014 1:33 AM, Eli Zaretskii wrote:
>>> Date: Sun, 20 Apr 2014 23:03:20 -0700
>>> From: Asmus Freytag <asmusf at ix.netcom.com> <asmusf at ix.netcom.com>
>>> CC: Eli Zaretskii <eliz at gnu.org> <eliz at gnu.org>, unicode at unicode.org,
>>> Kenneth Whistler <ken at unicode.org> <ken at unicode.org>
>>> Note that the current embedding level is not changed by this rule.
>>> What does this last sentence mean by "the current embedding level"?
>>> The first bullet of X6 mandates that "the current character's
>>> embedding level" _is_ changed by this rule, so what other "current
>>> embedding level" is alluded to here?
>>> I'm punting on that one - can someone else answer this?
>>> I assume "current embedding level" here meant "the embedding level of
>>> the last entry on the directional status stack". (This is a natural
>>> slip to make if you think in terms of an optimized implementation that
>>> stores each component of the top of the directional status stack in a
>>> variable, as suggested in 3.3.2.)
>>> In general, I heartily dislike "specifications" that just narrate a
>>> particular implementation...
>>> I cannot agree more.
>>> In fact, my main gripe about the UBA additions in 6.3 are that some of
>>> their crucial parts are not formally defined, except by an algorithm
>>> that narrates a specific implementation. The two worst examples of
>>> that are the "definitions" of the isolating run sequence and of the
>>> bracket pair. I didn't ask about those because I succeeded to figure
>>> them out, but it took many readings of the corresponding parts of the
>>> document. It is IMO a pity that the two main features added in 6.3
>>> are based on definitions that are so hard to penetrate, and which
>>> actually all but force you to use the specific implementation
>>> described by the document.
>>> My working definition that replaces BD13 is this:
>>> An isolating run sequence is the maximal sequence of level runs of
>>> the same embedding level that can be obtained by removing all the
>>> characters between an isolate initiator and its matching PDI (or
>>> paragraph end, if there is no matching PDI) within those level runs.
>>> As for bracket pair (BD16), I'm really amazed that a concept as easy
>>> and widely known/used as this would need such an obscure definition
>>> that must have an algorithm as its necessary part. How about this
>>> A bracket pair is a pair of an opening paired bracket and a closing
>>> paired bracket characters within the same isolating run sequence,
>>> such that the Bidi_Paired_Bracket property value of the former
>>> character or its canonical equivalent equals the latter character or
>>> its canonical equivalent, and all the opening and closing bracket
>>> characters in between these two are balanced.
>>> Then we could use the algorithm to explain what it means for brackets
>>> to be balanced (for those readers who somehow don't already know
>>> Again, thanks for clarifying these subtle issues. I can now proceed
>>> to updating the Emacs bidirectional display with the changes in
>>> Unicode 6.3.
>>> FWIW here is the restatement of BD16 that I used for myself (and that
>>> I put
>>> into the source comments of the sample Java implementation):
>>> // The following is a restatement of BD 16 using non-algorithmic
>>> // A bracket pair is a pair of characters consisting of an opening
>>> // paired bracket and a closing paired bracket such that the
>>> // Bidi_Paired_Bracket property value of the former equals the
>>> // subject to the following constraints.
>>> // - both characters of a pair occur in the same isolating run
>>> // - the closing character of a pair follows the opening character
>>> // - any bracket character can belong at most to one pair, the
>>> earliest possible one
>>> // - any bracket character not part of a pair is treated like an
>>> ordinary character
>>> // - pairs may nest properly, but their spans may not overlap
>>> // Bracket characters with canonical decompositions are supposed to
>>> be treated
>>> // as if they had been normalized, to allow normalized and
>>> non-normalized text
>>> // to give the same result.
>>> Your language is more concise, but you may compare for differences.
>>> Unicode mailing list
>>> Unicode at unicode.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode