Hyphenation Markup

Sat Jun 2 22:31:32 CDT 2018

On Sat, 2 Jun 2018 14:33:01 -0600
Doug Ewell via Unicode <unicode at unicode.org> wrote:

> Richard Wordingham wrote:
> 
> >> What about U+200B ZWSP?  
> >
> > Thanks for the suggestion, but it's not likely to work:  
> 
> Are you asking what schemes exist, or are you trying to call
> attention to some rendering engine and/or font that doesn't render a
> combination as it should?

I'm asking what exists, or is reasonably supposed to exist. 

> This is too general for me to parse. Can you replace these
> hypotheticals with actual characters, using code points, or at least
> with actual General Categories? For example, an 'Mc' followed by ZWSP
> followed by an 'Lo' displays like such-and-so. The code points would
> be best.

On Sun, 3 Jun 2018 09:26:40 +0900
"Martin J. Dürst via Unicode" <unicode at unicode.org> wrote:
> My question goes a bit further than to Doug's: Why would you want to
> do such a thing? Are there actual scripts/languages where line breaks 
> within grapheme clusters occur? If yes, what are there? Can you show 
> actual examples, e.g. scans of documents,...?

Three examples are given on p230 of the dissertation "Buddhist Monks and
their Search for Knowledge: an examination of the personal collection of
manuscripts of Phra Khamchan Virachitto (1920-2007), Abbot of Vat Saen
Sukharam, Luang Prabang" by Bounleuth Sengsoulin, available at
http://ediss.sub.uni-hamburg.de/volltexte/2016/8039/pdf/Dissertation.pdf .
The text is in Lao in the Tham script.  The transcriptions in the
text are transliterated to the Lao script.

The first example, transliterated to Lao, is ເມຽ,  which one could
encode as <U+0EC0 LAO VOWEL SIGN E, U+00AD SOFT HYPHEN, U+0EA1 LAO
LETTER MO, U+0EBD LAO SEMIVOWEL SIGN NYO>, provided the soft hyphen had
no visual representation beyond the line break.  (Strictly, it's a
break for a hole for a string.)  The third example is likewise ໄຫວ
<U+0EC4 LAO VOWEL SIGN AI, U+00AD SOFT HYPHEN, U+0EAB LAO LETTER HO
SUNG, U+0EA7 LAO LETTER WO>. (I can't make out the second example.)
However, the text is actually in the Tham script, and without any
line-breaking controls, the first and third examples read, marking the
grapheme cluster boundaries with '|', as ᨾ᩠ᨿᩮ <U+1A3E TAI THAM LETTER
MA, U+1A60 TAI THAM SIGN SAKOT | U+1A3F TAI THAM LETTER LOW YA, U+1A6E
TAI THAM VOWEL SIGN E> and ᩉ᩠ᩅᩱ <U+1A4C TAI THAM LETTER LOW HA, U+1A60
TAI THAM SIGN SAKOT | U+1A45 TAI THAM LETTER WA, U+1A71 TAI THAM VOWEL
SIGN AI>.  The internal grapheme cluster boundaries are purely stopping
points for cursor movement; they correspond to nothing graphical
and to nothing in user conception.  The natural internal boundaries are
just before the vowels, which are written on the left, and between the
base and subscript characters, i.e. before U+1A60.

There seem to be Northern Thai Pali examples in the proposal
L2/2007-007 at the end of
https://www.unicode.org/L2/L2007/07007r-n3207r-lanna.pdf Figure 9a Page
2 Line 3, and at the end of Figure 9b Page 1 Line 2, but I can't read
the Pali well enough to be sure that the apparent visually line-final
instances of TAI THAM SIGN E are not just scribal blunders. 

Reverting to Doug's reply:
> > Incidentally, does CLDR define the rendering of soft hyphen, or is
> > one entirely at the mercy of the application?  

> Why would this be a CLDR thing?

Because the rendering is quite likely to depend on locale.  I had
always understood that Thai did not mark breaks in words - and then I
discovered them in the Royal Institute Dictionary!  The correct German
rendering of soft hyphens has recently changed.  There are also subtle
effects when Dutch words are hyphenated.  These rules are not the same
as for English, but Unicode tends not to deal in dependencies finer
than a script.

Richard.