Pure Regular Expression Engines and Literal Clusters

Richard Wordingham via Unicode unicode at unicode.org
Tue Oct 8 09:25:34 CDT 2019

I've been puzzling over how a pure regular expression engine that works
via a non-deterministic finite automaton can be bent to accommodate
'literal clusters' as in Requirement RL2.2 'Extended Grapheme Clusters'
of UTS#18 'Unicode Regular Expressions' - "To meet this requirement, an
implementation shall provide a mechanism for matching against an
arbitrary extended grapheme cluster, a literal cluster, and matching
extended grapheme cluster boundaries."  It works from a regular
expression by stitching together the FSMs corresponding to its elements.

An example UTS#18 gives for matching a literal cluster can be simplified
to, in its notation:

[c \q{ch}]

This is interpreted as 'match against "ch" if possible, otherwise
against "c".  Thus the strings "ca" and "cha" would both match the

[c \q{ch}]a

while "chh" but not "ch" would match against

[c \q{ch}]h

Or have I got this wrong?

Thus, while "[c \q{ch}]" may be a regex, it is clearly not any notation
for a regular expression in the mathematical sense.

It seems to me that this expression requires backtracking, which is
totally alien to the design of the regular expression engine.  One
problem then is that the engine supports both the union and
intersection of regular languages.  While algebraic manipulation might
raise union to the highest level, eliminating intersection is an
expensive operation which I have deliberately avoided.  While
backtracking is feasible if state progression has been restricted to
the FSM for a literal cluster, it is far more difficult if multiple
FSMs have been running in parallel.

As the engine fully respects canonical equivalence (with the result
that it can find an accented letter of the Vietnamese alphabet even if
it bears a subscript tone mark), concatenated subexpressions can
divide the input streams between them.  Consequently, the
backtracking mechanism gets complicated.

May I correctly argue instead that matching against literal clusters
would be satisfied by instead supporting, for this example, the regular
subexpression "(c|ch)" or the UnicodeSet expression "[c{ch}]"?


More information about the Unicode mailing list