Displaying Lines of Text as Line-Broken by a Human

Richard Wordingham via Unicode unicode at unicode.org
Mon Jul 22 11:18:57 CDT 2019

On Sun, 21 Jul 2019 20:53:19 -0700
Asmus Freytag via Unicode <unicode at unicode.org> wrote:

> There's really no inherent need for many spacing combining marks to
> have a base character. At least the ones that do not reorder and that
> don't overhang the base character's glyph.

We are in agreement here.

> As far as I can  tell, it's largely a convention that originally
> helped identify clusters and other lack of break opportunities. But
> now that we have separate properties for segmentation, it's not
> strictly necessary to overload the combining property for that
> purpose.

Which relates to the separate question I asked about breaking at
grapheme boundaries.  Interestingly, I'm not seeing breaks next to an
invisible stacker, but that may be because Pali subscript consonants
only slightly increase the width of the cluster.

The need for a base makes sense for reordering spacing marks, but should
be to detect editing errors, not deliberate effects.  An unreordered
rordering mark plus consonant is visually ambiguous with consonant plus
reordering mark.

> In you example, why do you need the ZWJ and dotted circle?

The user- and application-supplied text would be
<NBSP, ZWJ, spacing_mark>.

> Originally, just applying a combining mark to a NBSP should normally
> show the mark by itself. If a font insists on inserting a dotted
> circle glyph, that's not required from a conformance perspective -
> just something that's seen as helpful (to most users).

It's not the font that inserts the dotted circle, it's the rendering
engine.  That's why the USE set Tai Tham rendering back several
years.  Now, there is at least one renderer (HarfBuzz) for which a
cunning font can work out whether the renderer has introduced the
dotted circle glyph rather than it being in the text to be rendered.  I
am looking for a general font-level solution to the problem that would
even work on Windows 10.

The ZWJ seems a reasonable hint that the space should be rendered with
zero width.  Do you think it is reasonable for <NBSP, spacing_mark> to
have zero width contribution from the NBSP when the spacing mark has a
non-overhanging glyph? It seems to be an unstandardised area, but zero
width might be considered to violate the character identity of NBSP.

I also have the problem of visually line-final U+1A6E TAI THAM VOWEL
SIGN E, which needs to be separated from a preceding consonant in the
backing store.  It seems to be particularly common before the holes
(two per page) for the string that holds the pages together.   Perhaps
the scribe tried to avoid line-final U+1A6E.

There are examples of these issues in Figure 9b of
http://www.unicode.org/L2/L2007/07007r-n3207r-lanna.pdf .  The last
syllable of _cattāro_ 'four' straddles lines 2 and 3, with its first
glyph (corresponding to SIGN E) ending line 2, and <RA, SIGN AA>
starting line 3.

The antepenultimate syllable of _sammodamānehi_ (misspelt
_samoddamānehi_) 'pleasing' is split between lines 7 and 8, with line 7
ending in MA and line 8 starting in SIGN AA.

I am looking for advice on what is the least bad readily achievable
solution. I can then adapt that to cope with the messier issue of the
non-spacing character U+1A58 TAI THAM SIGN MAI KANG LAI, which acts
like Burmese kinzi in the Pali text I am working on.  (If one does not
know the font well, one should not put a line break next to it unless
all other options are exhausted.)  Figure 9b also has an example of this
issue.  The initial consonant of saṅkhepaṃ (misspelt saṅkheppaṃ)
'collection, summary' is on line 9, while the rest of the word,
starting <MAI KANG LAI, HIGH KHA, SIGN E>, is on line 10. 

There is weird hack that currently helps with LibreOffice - inserting
CGJ turns off some parts of Indic shaping in the rest of the run.  Or
have I missed some new specification of Indic encoding?  This helps
with visually line-final SIGN E.


More information about the Unicode mailing list