Bidi reordering of soft hyphen

Whistler, Ken ken.whistler at
Tue Apr 1 18:41:48 CDT 2014

> Is it legitimate to truncate the context to a single line?  The BiDi

> algorithm is attempting to interpret unlabelled text as embedded text

> (it's not an arbitrary dance), and in just one line there is no

> indicator of whether the hyphen is part of the LTR text embedded in RTL

> text.

For this discussion, I think yes. See Section 3.4 of UAX #9:

The following rules describe the logical process of finding the correct display order. As opposed to resolution phases, these rules act on a per-line basis and are applied after any line wrapping is applied to the paragraph.

The main collection of UBA rules apply on a per-paragraph basis, but

you cannot actually do reordering of the resolved levels until you

have specified the line breaks. Effectively, the hyphenation decision

has to be taken first. And *then* you can reorder the results line-by-line.

So once we have the decision where we are breaking “car-/rot”, we

can then talk just about where the “car-“ ends up on the single line.

But I agree that there are many conundrums for trying to hyphenate

individual words in mixed-direction bidi text, so I am not surprised

that there would be special typographical conventions which might,

as Asmus suggested, require dropping in LRM’s or the like, if you wanted

the visual placement of hyphens to override the basic behavior of the algorithm.

> However, the very next character is 'r', which tells us that the

> left-to-right run contains the hyphen.  I also think the HYPHEN-MINUS

> is the wrong character to consider - the analogy should be with  U+2010

> HYPHEN (class ON) rather than with U+2212 MINUS SIGN (class ES), let

> alone the ambiguous HPYHEN-MINUS, for which ES is merely the

> interpretation most likely to work.

Well, sure, but for the purposes of *this* particular discussion, it makes

no difference whatsoever whether we are using U+002D or U+2010,

despite the difference in Bidi_Class, since there is no question of numerical

formatting here. Rule W6 will convert  the bc=ES to bc=ON, and

thereafter the processing is identical:

Trace: Entering br_UBA_ResolveTerminators [W5]

Current State: 11

  Text:        05D0 05D1 05D2 0020 0063 0061 0072 002D

  Bidi_Class:     R    R    R   WS    L    L    L   ES

  Levels:         1    1    1    1    1    1    1    1

  Runs:        <R-----------------------------------R>

Trace: Entering br_UBA_ResolveESCSET [W6]

Current State: 12

  Text:        05D0 05D1 05D2 0020 0063 0061 0072 002D

  Bidi_Class:     R    R    R   WS    L    L    L   ON

  Levels:         1    1    1    1    1    1    1    1

  Runs:        <R-----------------------------------R>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list