Ambiguous hyphenation cases with

Christoph Päper christoph.paeper at
Tue Jul 22 11:14:11 CDT 2014

fantasai <fantasai.lists at>:

>> The problem is that the hyphenation system in itself can't decide how
>> to change the spelling, without any "dictionary"   functionality. It
>> can't know if I meant "mat-tjuv" ("food thief" in Swedish) or "matt-tjuv"
>> ("carpet thief") when I wrote "mat­tjuv". So there has to be a way
>> to tell the hyphenation system that.

Imagine if there was also ‘matt·juv’ next to ‘mat·tjuv’ and ‘matt·tjuv’, or even ‘mat·ttjuv’.

> Hm. I don't think I have a solution for that problem. :/
> Currently you'd just have to not hyphenate that word.

Smart-font solution (OpenType, AFDKO syntax):

  “mattjuv, matttjuv”

  lookup tripleletters {
    sub t' t' t by t;
  feature rlig {
    script latn;
    language SWE exclude_dflt;
    lookup tripleletters;
  } rlig;

Combining Grapheme Joiner (U+034F, ‘CGJ’) could possibly be given an interpretation like this (XML syntax), but Zero-Width Non-Joiner  (U+200C, ‘ZWNJ’) should probably not be repurposed:

  “mattjuv, mat&#x34F;tjuv”

Possible Unicode solution with a new combining character that makes the preceding character or grapheme – I’m not sure which – invisible except at the end of a line:

  “mattjuv, matt&#x2065;tjuv”

  U+2065 – Combining Collapse or Reduplicating Soft Hyphen or so

All solutions require author education. The latter two require changes to existing software and specifications (including CSS), the former “just” updated fonts. The second solution would fall back gracefully to ‘mattjuv’, the others to ‘matttjuv’, maybe even with a .notdef glyph in there.

All of these approaches are too complicated for Joe Sixpack (or Jo Sexpack), so I don’t think that will work in practice, except in environments that already make sure to treat border cases like disambiguation of umlaut and diaeresis use of trema dots.

JFTR, Swedish is not the only language with this orthographic feature. The German orthography reform of 1996 did away with letter collapsing completely, probably for this very problem. Now there are instances of three times the same letter on the same line, which some consider ugly, but smart fonts can overcome most of the perceived problems by ligating the first two letters of such a sequence or by selecting an alternate glyph for the final one. The special treatment of the double-‘k’ grapheme was also abolished: It used to look like ‘ck’ – often a ligature – except at the end of the line where it showed its real face, ‘k-k’; now it’s always typed, encoded and displayed as ‘ck’ and cannot be separated. Theoretical graphemes ‘zz' and ‘hh’ still look like ‘tz’ and ‘ch’ respectively, whereof only the former may be split ‘t-z’.

More information about the Unicode mailing list