Unclear text in the UBA (UAX#9) of Unicode 6.3
asmusf at ix.netcom.com
Sun Apr 20 14:58:23 CDT 2014
On 4/20/2014 3:24 AM, Eli Zaretskii wrote:
> Would someone please help understand the following subtleties and
> obscure language in the UBA document found at
> http://www.unicode.org/reports/tr9/? Thanks in advance.
I've tried to give you some explanations - in some places, I concur with
you that the wording could be improved and that such improved wording
should be proposed to the UTC (or its editorial committee) for
incorporation into a future update.
For details, see below.
> 1. In paragraph 3.1.2, near its very end, we have this sentence (with
> my emphasis):
> As rule X10 will specify, an isolating run sequence is the unit to
> which the rules following it are applied, and the last character of
> one level run in the sequence is considered to be immediately
> followed by the first character of the next level run in the
> sequence during this phase of the algorithm.
> What does it mean here by "the rules following it"? Following what?
That looks like a bad referent, but from context, this "it" must be X10
> 2. In BD16 (paragraph 3.1.3), the 1st bullet says:
> . Create a stack for elements each consisting of a bracket character
> and a text position. Initialize it to empty.
> But then 1st sub-bullet of the 3rd bullet says:
> . If an opening paired bracket is found, push its
> Bidi_Paired_Bracket property value and its text position onto
> the stack.
> But the stack does not hold values of Bidi_Paired_Bracket property, it
> holds characters.
The Bidi_Paired_Bracket property is a character code (it is the
character code of the other
partner in the pair).
> Items 2 and 3 below that say:
> 2. Compare the closing paired bracket being inspected or its
> canonical equivalent to the bracket in the current stack
> 3. If the values match, meaning the two characters
> form a bracket pair, then [...]
> So I guess the 1st bullet is correct, but the 3rd bullet should say
> "... push the opening paired bracket character and its text position
> onto the stack". Is this the correct interpretation?
What's really required is that the stack contain a unique identifier for
each bracket pair, so that, given a function that maps either opening or
closing brackets (or their canonical equivalents) to this id, one can
determine that both character belong to the same pair.
This unique id could be the opening or the closing bracket (or its
canonical equivalent), it makes to practical difference. However, it
looks like UAX#9 is written in terms of the code point for the closing
Bullet 1 could be changed to
. Create a stack for elements each consisting of a*code point* (Bidi_Paired_Bracket property value)
and a text position. Initialize it to empty.
to make things more clear. And a slight wording change might help the
reader with item 2:
2. Compare the*code point for the*closing paired bracket being inspected or its
canonical equivalent to the*code poin*t (Bidi_Paired_Bracket property value) in the current stack
And, to continue
3. If the values match, meaning*the character being inspected and the character**
** at the text position in the stack* form a bracket pair, then [...]
> 3. Paragraph 3.3.2 says, under "Non-formatting characters":
> X6. For all types besides B, BN, RLE, LRE, RLO, LRO, PDF, RLI, LRI,
> FSI, and PDI:
> . Set the current character’s embedding level to the embedding
> level of the last entry on the directional status stack.
> Note that the current embedding level is not changed by this rule.
> What does this last sentence mean by "the current embedding level"?
> The first bullet of X6 mandates that "the current character’s
> embedding level" _is_ changed by this rule, so what other "current
> embedding level" is alluded to here?
I'm punting on that one - can someone else answer this?
> 4. Rule X10 says in its last bullet:
> Apply rules W1–W7, N0–N2, and I1–I2, in the order in which they
> appear below, to each of the isolating run sequences, applying one
> rule to all the characters in the sequence in the order in which
> they occur in the sequence before applying another rule to any part
> of the sequence. The order that one isolating run sequence is
> treated relative to another does not matter.
> Does the last sentence mean that it is OK to apply W1 to the 1st
> isolating sequence, then apply W1 to the second isolating sequence,
> then apply W2 to the 1st isolating sequence, followed by W2
> application to the 2nd isolating sequence, etc.? IOW, the last
> sentence refers to the order of processing between the isolating run
> sequences, but says nothing about the order of applying rules between
> the sequences.
Apply rules W1–W7, N0–N2, and I1–I2 to each of the isolating run sequences.
For each sequence, [completely] apply each rule in the order in which they appear below.
The order that one isolating run sequence is treated relative to another does not matter.
I believe the above restatement expresses the same thing in fewer words.
The "completely" may be unnecessary. The text about applying the rules to "all
characters" seems to be unnecessary, unless there is, in any of the rules, an
option to not apply it to some characters. Unless incomplete application is
envisaged, calling out the "all characters" here just confuses.
> 5. Rule N0 says:
> . For each bracket-pair element in the list of pairs of text positions
> a. Inspect the bidirectional types of the characters enclosed
> within the bracket pair.
> b. If any strong type (either L or R) matching the embedding
> direction is found, set the type for both brackets in the pair
> to match the embedding direction.
> First, what is meant here by "strong type [...] matching the embedding
> direction"? Does the "match" here consider only the odd/even value of
> the current embedding level vs R/L type, in the sense that odd levels
> "match" R and even levels "match" L? Or does this mean some other
> kind of matching? Table 3, which the only place that seems to refer
> to the issue, is not entirely clear, either:
> e The text ordering type (L or R) that matches the embedding level
> direction (even or odd).
> Again, the sense of the "match" here is not clear.
even/odd --- R/L match, might be made more explicit
> Next, what is meant here by "the characters enclosed within the
> bracket pair"? If the bracket pair encloses another bracket pair,
> which is inner to it, do the characters inside the inner pair count
> for the purposes of resolving the level of the outer pair?
They do, so there's no need to change the text.
> Lastly, I presume that by "the bidirectional types of the enclosed
> characters" the text means the resolved types as modified by the
> preceding phases, not the original types. Is that correct?
It's the strong type assigned by rule N0.
> Again, thanks in advance for any help.
> Unicode mailing list
> Unicode at unicode.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode