Regular Expressions and Canonical Equivalence

Philippe Verdy verdy_p at
Fri May 15 19:41:33 CDT 2015

With a NFA, the representation is completely different, The regexp


is just transformed into:


where I noted with the "tack" the 15 relative positions **in this new
regex** where there's a need to check if the input matches a character or
character class.

Note that in this transform, all allowed permutations of canonically
equivalent substrings are added; given that these substrings are bounded in
length in the initial regexp and there's a limited number of permutations
the result is still bounded.

The state of the NFA is represented as a set of these positions (here a
bitset with 15 bits). The initial state has only the first bit set to true,
the final accept state must just have the 15th bit set to true.

When you scan the input, you have to test the inpout character for each
position in the bitset that is currently true, if the associated character
or character class matches the input, and then advance this bit.

For that you use a second separate bitset initially empty (all 15 bits set
to false), and to advance bit n in the state, you will set bit (n+1) to
true; but to advance from bit 15, you don't set any in the second bitset.

You may also want to avoid generating these permutations:


Here I noted with the "combining glottal stop" the positions where you have
to count the characters ONLY in the subsequence, i.e. my "tacks" that are
just after the asterisks.

However in both cases (either the generated permutations, or using
counters) you'll need to use backtracing for rollbacks. Performing a
rollback in an NFA is not easy! you have to remember the bitsets
representing the state of the NFA before you advanced it to the next
state... If you did not want to generate the permutations but only use
counters, the backtracing to keep must also contain these counters (I have
no idea how to safely rollback those counters, my opinion is that it will
not work, and generating the permutations, even if this increases the
number of "tack" positions in the transformed regexp, is MUCH simpler, and
does not really generate a significant cost in term of memory).

But it's true that the allowed reorderings implied by canonical
equivalences (and those that are NOT allowed because they are blocked) are
really challenging !

2015-05-16 2:04 GMT+02:00 Philippe Verdy <verdy_p at>:

> But do you agree that we still need to match pairs of distinct characters
> in your example ?
> If you just count the otal it will be wrong with (\u0302\u0302\0323)* if
> you transform it into (\u0302|\u0302|\0323)* which is fully equivalent to
> (\u0302|\0323)*, because what you want is not matching pairs but triples
> (your counter check would have now to make sure there are two times more
> occurences of \u0302 and occurences of \u0323.
> If not, you need to rollback (by one or two characters, possibly more,
> until you satisfy the condition, but you won't know by just seeing the
> characters and advancing that your sequence is terminated: it is only at
> end that you have to do this check and only then you can rollback :
> The analysis cannot be deterministic, or it requires keeping a track of
> all acceptable positions previously seen that could satisfy the condition;
> as the sequence for (\u0302\u0302\0323)* can be extremely long, keeping
> this track for possible rollbacks coudl be costly. For example consider
> this regexp:
> (\u0302\u0302\0323)*(\u0302\0303)*\u0302
> Can you still transform it and correctly infer the type of counters you
> need for the final check (before rollbacks) if you replace it with:
> (\u0302|\0323)*(\u0302|\0303)*\u0302 which is fully equivalent to
> (\u0302|\0303|\0323)*\u03202.
> You'd need to check that there are exactly
> - (2n+1) occurences of \0302
> - (n) occurences of \0303
> - (n) occurences of \0323
> But it won't be enough because \0302 and \0303 have the same combining
> class and cannot be reordered. So within the first regexp:
> (\u0302\u0302\0323)*(\u0302\0303)*\u0302
> the first iterated subregexp will need to scan first within the part that
> is to match in the second iterated subregexp, where you cannot predict
> where it will stop. It may even scan completely through it (if you have not
> encountered any \0303) and eaten the last \u0302. At this time, you may see
> that the first iterated subregexp cannot contain any \u0303 so the first
> rollback to do will be to rollback just before the 1st occurence of \0303.
> But the counter check may still be wrong and you'll have to rollback
> through one or two occurences of \u0302 in order to find the location where
> the first iterated subregexp is satisfied. At this point the one ot two
> occurences of \u0302 that you've rolled back will be counted as being part
> of the 2nd iterated regexp, or even be the final occurence looked to match
> the end of the regexp.
> I don't see how you can support this regexp with a DFA, you absolutely
> need an NFA (and the counters you want to add do not offer any decisive
> help).
> 2015-05-16 0:54 GMT+02:00 Richard Wordingham <
> richard.wordingham at>:
>> On Sat, 16 May 2015 00:31:53 +0200
>> Philippe Verdy <verdy_p at> wrote:
>> > 2015-05-15 23:57 GMT+02:00 Richard Wordingham <
>> > richard.wordingham at>:
>> >
>> > > On Fri, 15 May 2015 22:09:13 +0200
>> > > Philippe Verdy <verdy_p at> wrote:
>> > >
>> > > > 2015-05-15 9:10 GMT+02:00 Richard Wordingham <
>> > > > richard.wordingham at>:
>> > >
>> > > > This is because you don't understand the issue !
>> > >
>> > > > > Now, a program to check whether a trace matching
>> > > > > {\u0323|\u0302)* matches (\u0323\u0302)* is very simple.  It
>> > > > > just counts the number of times \u0323 occurs and the number of
>> > > > > times \u0302 occurs, and returns whether they are equal.
>> > >
>> > > > This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would
>> > > > pass your counting test (which does not work in a FSA) but they
>> > > > are NOT canonically equivalent because the identical combining
>> > > > characters are blocking each other (so arbitrary ordering is not
>> > > > possible).
>> > >
>> > > TUS7.0: D108   Reorderable pair:
>> > >  Two adjacent characters A and B in a coded character sequence
>> > >  <A, B> are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0.
>> > >
>> > > Now, ccc(U+0302) = 230 > 220 = ccc(U+0323) > 0, so (U+0302, U+0303)
>> > > is a reorderable pair.
>> > >
>> >
>> > I do NOT contest that U+0323 and U+0302 can reorder, but the fact that
>> > U+0323 blocks another occurence of U+0323 because it has the **same**
>> > combining class.
>> How does that stop <U+0323, U+0323, U+0302, U+0302> and <U+0323,
>> U+0302, U+0323, U+0302> being canonically equivalent?
>> TUS7.0: D109   'Canonical Ordering Algorithm' says:
>> "In a decomposed character sequence D, exchange the positions of the
>> characters in each Reorderable Pair until the sequence contains no more
>> Reorderable Pairs."
>> There is no mention of blocking in D109.
>> Richard.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list