Unclear text in the UBA (UAX#9) of Unicode 6.3

Asmus Freytag asmusf at ix.netcom.com
Mon Apr 21 09:32:15 CDT 2014


On 4/21/2014 1:33 AM, Eli Zaretskii wrote:
>> Date: Sun, 20 Apr 2014 23:03:20 -0700
>> From: Asmus Freytag <asmusf at ix.netcom.com>
>> CC: Eli Zaretskii <eliz at gnu.org>, unicode at unicode.org,
>>   Kenneth Whistler <ken at unicode.org>
>>
>>>>          Note that the current embedding level is not changed by this rule.
>>>>
>>>>      What does this last sentence mean by "the current embedding level"?
>>>>      The first bullet of X6 mandates that "the current character’s
>>>>      embedding level" _is_ changed by this rule, so what other "current
>>>>      embedding level" is alluded to here?
>>>      I'm punting on that one - can someone else answer this?
>>>
>>>
>>> I assume "current embedding level" here meant "the embedding level of
>>> the last entry on the directional status stack". (This is a natural
>>> slip to make if you think in terms of an optimized implementation that
>>> stores each component of the top of the directional status stack in a
>>> variable, as suggested in 3.3.2.)
>>>
>>> James
>>>
>> In general, I heartily dislike "specifications" that just narrate a
>> particular implementation...
> I cannot agree more.
>
> In fact, my main gripe about the UBA additions in 6.3 are that some of
> their crucial parts are not formally defined, except by an algorithm
> that narrates a specific implementation.  The two worst examples of
> that are the "definitions" of the isolating run sequence and of the
> bracket pair.  I didn't ask about those because I succeeded to figure
> them out, but it took many readings of the corresponding parts of the
> document.  It is IMO a pity that the two main features added in 6.3
> are based on definitions that are so hard to penetrate, and which
> actually all but force you to use the specific implementation
> described by the document.
>
> My working definition that replaces BD13 is this:
>
>    An isolating run sequence is the maximal sequence of level runs of
>    the same embedding level that can be obtained by removing all the
>    characters between an isolate initiator and its matching PDI (or
>    paragraph end, if there is no matching PDI) within those level runs.
>
> As for bracket pair (BD16), I'm really amazed that a concept as easy
> and widely known/used as this would need such an obscure definition
> that must have an algorithm as its necessary part.  How about this
> instead:
>
>    A bracket pair is a pair of an opening paired bracket and a closing
>    paired bracket characters within the same isolating run sequence,
>    such that the Bidi_Paired_Bracket property value of the former
>    character or its canonical equivalent equals the latter character or
>    its canonical equivalent, and all the opening and closing bracket
>    characters in between these two are balanced.
>
> Then we could use the algorithm to explain what it means for brackets
> to be balanced (for those readers who somehow don't already know
> that).
>
> Again, thanks for clarifying these subtle issues.  I can now proceed
> to updating the Emacs bidirectional display with the changes in
> Unicode 6.3.
>
>
FWIW here is the restatement of BD16 that I used for myself (and that I put
into the source comments of the sample Java implementation):

     // The following is a restatement of BD 16 using non-algorithmic 
language.
     //
     // A bracket pair is a pair of characters consisting of an opening
     // paired bracket and a closing paired bracket such that the
     // Bidi_Paired_Bracket property value of the former equals the latter,
     // subject to the following constraints.
     // - both characters of a pair occur in the same isolating run sequence
     // - the closing character of a pair follows the opening character
     // - any bracket character can belong at most to one pair, the 
earliest possible one
     // - any bracket character not part of a pair is treated like an 
ordinary character
     // - pairs may nest properly, but their spans may not overlap otherwise

     // Bracket characters with canonical decompositions are supposed to 
be treated
     // as if they had been normalized, to allow normalized and 
non-normalized text
     // to give the same result.

Your language is more concise, but you may compare for differences.

A./
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140421/b6ad0146/attachment.html>


More information about the Unicode mailing list