Regular Expressions and Canonical Equivalence

Richard Wordingham richard.wordingham at
Fri May 15 02:10:03 CDT 2015

On Fri, 15 May 2015 02:10:36 +0200
Philippe Verdy <verdy_p at> wrote:

> 2015-05-14 20:13 GMT+02:00 Richard Wordingham <
> richard.wordingham at>:
> > On Thu, 14 May 2015 12:58:29 +0200
> > Philippe Verdy <verdy_p at> wrote:
> >
> > > 2015-05-14 9:59 GMT+02:00 Richard Wordingham <
> > > richard.wordingham at>:
> > >
> > > > An elegant formal solution to the Kleene star problem interprets
> > > > (\u0323\u0302)* as (\u0323|\u0302)*.  However, that is
> > > > counter-intuitive
> >
> > The technical term for this is the 'concurrent iteration' - or at
> > least, that's the term used in the 'Book of Traces'.
> >
> > > For your example "(\u0323\u0302)*" the characters in the
> > > once converted to NFD (which is the same here) are just using at
> > > most two distinct non-zero combining classes and no blocker; so
> > > it is safe to trasform it to (\u0323|\u0302)* for a first pass
> > > matching that will then only check candidate matchs in the second
> > > pass. or more efficiently, a second finite state automata (FSA)
> > > running in parallel with its own state:
> >
> > You've forgotten the basic problem.  A *finite* state automaton
> > cannot count very far; with only n states, it cannot count as far
> > as n.
> >
> I did not forget it, this is why there's a second pass (or a second
> FSA running in parallel to indicate its own accept state). You have
> to combine the two states variables to get the final combined state
> to determine if it is a final accept state.

Your description makes no sense to me as a description of a finite
state automaton.  Now, a program to check whether a trace matching
{\u0323|\u0302)* matches (\u0323\u0302)* is very simple.  It just
counts the number of times \u0323 occurs and the number of times
\u0302 occurs, and returns whether they are equal.  The two counters
are the key variables (and one could just keep the difference in the
counts).  However, this is not a finite state automaton.

Now, to some extent I am cheating by assuming that the characters are
delivered in NFD order.  If I did not do this, to construct the
non-deterministic finite automaton (NDFA) for the concatenation of two
sets / regular expressions, the triples (x, y, n) of (left NDFA state,
right NDFA state, highest non-zero ccc assigned to righthand component)
would need to be expanded.  The third component would become a list of
non-zero ccc's - in principle 2^254 values, but in fact rather fewer as
not all 255 ccc values are used by Unicode.  It is still finite.  I
prefer to keep the complexity out of the regular expression engine

Given a NDFA recognising a set of NFD strings, one can convert it to a
deterministic finite automaton (DFA), say X, provided one does not run
out of memory or time. One can then 'easily' construct a DFA Y
recognising the canonical equivalents of the strings.  The state in DFA
Y reached by string x is defined to be the state reached by the string
to_NFD(x) in DFA X.  This method relies on the identity
to_NFD(to_NFD(x)z) = to_NFD(xz). 

This handles the recognition of a string canonically equivalent to
\u0323\u0302. (The constructions above are sledge hammers; the NDFAs
have many unreachable states.) However, recognising canonical
equivalents of (\u0323\u0302)* via an FSM is rather more difficult; to
be precise, it cannot be done by an FSM.


More information about the Unicode mailing list