Bidi reordering of soft hyphen

Whistler, Ken ken.whistler at sap.com
Tue Apr 1 15:20:13 CDT 2014


I don’t think the answer is directly deduced from UAX #9, because
it involves deciding where to insert a visible hyphen for display.
However, I think the correct answer here is your number two guess,
i.e. (in a RTL paragraph context):

-car SI TORRAC

A way to think about this, rather than starting from the BN nature
of U+00AD, is to ask what would happen if there was an *explicit*
hyphen-minus at the same position. Shortening your example
line “CARROT IS car\u00AD” to just the equivalent of “ABC car-“,
the outcome of the bidiref processing for a RTL paragraph context is:

Trace: Entering br_UBA_ReverseLevels [L2]
Current State: 19
  Text:        05D0 05D1 05D2 0020 0063 0061 0072 002D
  Bidi_Class:     R    R    R    R    L    L    L    R
  Levels:         1    1    1    1    2    2    2    1
  Runs:        <R-----------------------------------R>

  Order:      [7 4 5 6 3 2 1 0]

In other words, on display:

-car CBA
<---------

with the hyphen-minus at the *end* of the reordered line, as
expected.

If you run the same example, but substituting U+00AD for U+002D, you get:

Trace: Entering br_UBA_ReverseLevels [L2]
Current State: 19
  Text:        05D0 05D1 05D2 0020 0063 0061 0072 00AD
  Bidi_Class:     R    R    R    R    L    L    L   BN
  Levels:         1    1    1    1    2    2    2    x
  Runs:        <R-----------------------------------R>

  Order:      [4 5 6 3 2 1 0]

And the display for that would be:

car CBA

But *then* your hyphenation algorithm would presumably kick in and decide
that the U+00AD is at the end of the line and should display as a visible
hyphen glyph. But “end of the line” here means the same as it would for
the explicit hyphen-minus, so when you insert the visible hyphen glyph, you
end up with the same result:

-car CBA

Another way of looking at this is that in order to line break your text in
the first place, you need to be able to calculate the resolved display width
to fit in the line. That would have to include the visual display of the inserted
hyphen glyph. So once you have *decided* to break the line at the soft
hyphen, in effect, you substitute a visual display symbol U+002D (or
the actual hyphen U+2010, etc.) for U+00AD. *Then* run the UBA on the
results to get the resolved order of all the elements on the line. The net
effect should be the same.

Maybe folks with full implementations of bidi rendering would have more to
contribute on this, but that would be my own take on the problem.

--Ken



Suppose I have a paragraph (uppercase = RTL):

   CARROT IS car\u00ADrot IN ENGLISH

and the paragraph gets broken at the soft hyphen.

Is the correct ordering for the first line

  car- SI TORRAC

or

  -car SI TORRAC

? I did not succeed in deducing the answer from UAX#9.  Soft hyphen has bidi class BN, which means it gets removed in stage X9, and so, if I have understood correctly, doesn't have a defined embedding level.

I'm guessing the correct ordering is the first one, but I don't trust my instincts here. (In particular, I wondered whether this was analogous to the case where rule L1 resets embedding levels so that trailing whitespace is at the visual end of the line.)

More generally, suppose you have a markup language which has a construct for discretionary breaks, as in TeX, with pre-break, post-break and no-break text. Soft hyphen is a special case of this (where the pre-break text consists of a hyphen, and the pos and no-break texts are empty); you can also regard space as a kind of discretionary break (post-break text empty, no-break text contains the space, pre-break text either contains the space or is empty, depending on how you want to think about it). Obviously the embedding level for the no-break text should be resolved as if discretionary break was replaced by the no-break text (which is consistent with a bidi class of BN for soft hyphen). However, for the pre- and post-break text, it is not clear to me what the right way is to resolve embedding levels (or how their content should be restricted so that there is a sensible way to resolve the embedding levels). I would be grateful for any suggestions.

James





-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140401/367450b2/attachment.html>


More information about the Unicode mailing list