Bidi reordering of soft hyphen

Richard Wordingham richard.wordingham at ntlworld.com
Tue Apr 1 18:02:57 CDT 2014


On Tue, 1 Apr 2014 20:20:13 +0000
"Whistler, Ken" <ken.whistler at sap.com> wrote:

> I don’t think the answer is directly deduced from UAX #9, because
> it involves deciding where to insert a visible hyphen for display.
> However, I think the correct answer here is your number two guess,
> i.e. (in a RTL paragraph context):
> 
> -car SI TORRAC
> 
> A way to think about this, rather than starting from the BN nature
> of U+00AD, is to ask what would happen if there was an *explicit*
> hyphen-minus at the same position.

Is it legitimate to truncate the context to a single line?  The BiDi
algorithm is attempting to interpret unlabelled text as embedded text
(it's not an arbitrary dance), and in just one line there is no
indicator of whether the hyphen is part of the LTR text embedded in RTL
text. However, the very next character is 'r', which tells us that the
left-to-right run contains the hyphen.  I also think the HYPHEN-MINUS
is the wrong character to consider - the analogy should be with  U+2010
HYPHEN (class ON) rather than with U+2212 MINUS SIGN (class ES), let
alone the ambiguous HPYHEN-MINUS, for which ES is merely the
interpretation most likely to work.

I found a similar example, but with Hebrew embedded in the Latin script,
in the introduction to the Stuttgart Bible.  The corresponding character
was U+05BE HEBREW PUNCTUATION MAQAF, though in this case the class is R
(because one doesn't expect MAQAF to be used with left-to right
scripts), and therefore not as good an example as I would have hoped
for.  The BiDi algorith then happily places the MAQAF internally,
making the analogy 'car- SI TORRAC'.  (I metaphorically embedded the
quote, so I don't get 'SI TORRAC car-', which is plain wrong.)    

Now, a valid opposing view is that the graphical representation of soft
hyphens says,  "When written out as one very long line, there is no
space between successive lines", as opposed to "This apparent word is
actually continued by text on the next line".  If you take the
interpretation of the marks operating at the level of lines, then '-car
SI TORRAC' is reasonable.  As English has the hyphen as a half-way
house between one word and two words, English very naturally works at
the word level.  I am not sure about other languages.

Richard.




More information about the Unicode mailing list