Regular Expressions and Canonical Equivalence

Richard Wordingham richard.wordingham at
Fri May 15 16:57:03 CDT 2015

On Fri, 15 May 2015 22:09:13 +0200
Philippe Verdy <verdy_p at> wrote:

> 2015-05-15 9:10 GMT+02:00 Richard Wordingham <
> richard.wordingham at>:

> This is because you don't understand the issue !

> > Now, a program to check whether a trace matching
> > {\u0323|\u0302)* matches (\u0323\u0302)* is very simple.  It just
> > counts the number of times \u0323 occurs and the number of times
> > \u0302 occurs, and returns whether they are equal.
> This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would
> pass your counting test (which does not work in a FSA) but they are
> NOT canonically equivalent because the identical combining characters
> are blocking each other (so arbitrary ordering is not possible).

TUS7.0: D108   Reorderable pair:
 Two adjacent characters A and B in a coded character sequence
 <A, B> are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0. 

Now, ccc(U+0302) = 230 > 220 = ccc(U+0323) > 0, so (U+0302, U+0303) is
a reorderable pair.

TUS7.0: D109   Canonical Ordering Algorithm:
 In a decomposed character sequence D, exchange the positions of the
 characters in each Reorderable Pair until the sequence contains no
 more Reorderable Pairs.

The normalisation process on <U+0323, U+0302, U+0323, U+0302> first
replaces it by <U+0323, U+0323, U+0302, U+0302>.  There are then no
more reorderable pairs, so that has reduced it to form NFD.

Therefore <U+0323, U+0323, U+0302, U+0302> and <U+0323, U+0302, U+0323,
U+0302> *are* canonically equivalent.

> So the FSA implementation is perfectly possible, for canonical
> equivalences only...

I now vaguely follow your argument, but it depends on the erroneous
claim that <U+0323, U+0323, U+0302, U+0302> and <U+0323, U+0302, U+0323,
U+0302> are not canonically equivalent.

> This evidently does not work if you are
> performing regexp searches using looser equivalences, such as
> compatibility equivalence.

I completely fail to understand this remark; it makes no difference
whether one uses canonical or compatibility equivalence.


More information about the Unicode mailing list