Regex for the USE to Handle Tai Tham

Richard Wordingham richard.wordingham at
Sat Feb 4 15:54:11 CST 2017

I'm not sure if this is the right forum for the question; if not,
please advise me where I should take the problem for public discussion.

The immediate problem is that the Universal Shaping Engine (USE) uses a
regular expression for Indic orthographic syllables that doesn't cover
the common CVC orthographic syllables of the Tai Tham script, let alone
the rarer CVCVC orthographic syllables.

In his paper earlier this year, 'Making fonts for the Universal Shaping
Engine' (available at, John Hudson
reported, "It’s called the Universal Shaping Engine, then, not because
it shapes all scripts, but because it uses a universal  model.  Of
course, as soon as you declare that you have a universal model, someone
comes along with an exception to that model. In this case, the
exception is the Tai Tham or Lanna script of northern Thailand, which
uses subjoined consonants in ways that may compress multiple syllables
into a single cluster, causing recursion in cluster analysis. It
remains to be seen whether Tai Tham can be accommodated with exception
code in the Universal Shaping Engine, or will need to be passed to a
script-specific engine."

Does anyone know what the problem is that caused the complaint that
Tai Tham needs "recursion in cluster analysis"?  For syllables
without a dangling stacking control code, the regular expression is
similar (see
for the precise form) to

base subscript* vowel* final*


subscript = medial | consonant_subjoined | subjoiner consonant

subjoiner = virama | coeng

final = final_consonant

I have omitted various modifiers for clarity.

Now, the obvious generalisation to cover the Tai Tham script (and,
incidentally, the Khmer script) is

base (subscript* vowel* final2*)*


final2 = final | subjoiner consonant

Now, I see iteration here, but we had it before, so I don't know what
the problematic 'recursion' is.

I can make various guesses.  Perhaps the regex needs to be
'unambiguous'.  Perhaps it needs to be 'deterministic', i.e. each
character can be matched to an element of the regex as soon as
encountered.  Perhaps the problem is just that the regex encourages
backtracking.  These possible issues all seem soluble, so please,
someone, what is the problem?


More information about the Unicode mailing list