Regular Expressions and Canonical Equivalence

Richard Wordingham richard.wordingham at ntlworld.com
Fri May 15 17:54:22 CDT 2015


On Sat, 16 May 2015 00:31:53 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-05-15 23:57 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > On Fri, 15 May 2015 22:09:13 +0200
> > Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> >
> > > 2015-05-15 9:10 GMT+02:00 Richard Wordingham <
> > > richard.wordingham at ntlworld.com>:
> >
> > > This is because you don't understand the issue !
> >
> > > > Now, a program to check whether a trace matching
> > > > {\u0323|\u0302)* matches (\u0323\u0302)* is very simple.  It
> > > > just counts the number of times \u0323 occurs and the number of
> > > > times \u0302 occurs, and returns whether they are equal.
> >
> > > This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would
> > > pass your counting test (which does not work in a FSA) but they
> > > are NOT canonically equivalent because the identical combining
> > > characters are blocking each other (so arbitrary ordering is not
> > > possible).
> >
> > TUS7.0: D108   Reorderable pair:
> >  Two adjacent characters A and B in a coded character sequence
> >  <A, B> are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0.
> >
> > Now, ccc(U+0302) = 230 > 220 = ccc(U+0323) > 0, so (U+0302, U+0303)
> > is a reorderable pair.
> >
> 
> I do NOT contest that U+0323 and U+0302 can reorder, but the fact that
> U+0323 blocks another occurence of U+0323 because it has the **same**
> combining class.

How does that stop <U+0323, U+0323, U+0302, U+0302> and <U+0323,
U+0302, U+0323, U+0302> being canonically equivalent?

TUS7.0: D109   'Canonical Ordering Algorithm' says:
"In a decomposed character sequence D, exchange the positions of the
characters in each Reorderable Pair until the sequence contains no more
Reorderable Pairs."

There is no mention of blocking in D109.

Richard.


More information about the Unicode mailing list