Regular Expressions and Canonical Equivalence

Philippe Verdy verdy_p at wanadoo.fr
Fri May 15 19:04:55 CDT 2015


But do you agree that we still need to match pairs of distinct characters
in your example ?
If you just count the otal it will be wrong with (\u0302\u0302\0323)* if
you transform it into (\u0302|\u0302|\0323)* which is fully equivalent to
(\u0302|\0323)*, because what you want is not matching pairs but triples
(your counter check would have now to make sure there are two times more
occurences of \u0302 and occurences of \u0323.
If not, you need to rollback (by one or two characters, possibly more,
until you satisfy the condition, but you won't know by just seeing the
characters and advancing that your sequence is terminated: it is only at
end that you have to do this check and only then you can rollback :

The analysis cannot be deterministic, or it requires keeping a track of all
acceptable positions previously seen that could satisfy the condition; as
the sequence for (\u0302\u0302\0323)* can be extremely long, keeping this
track for possible rollbacks coudl be costly. For example consider this
regexp:

(\u0302\u0302\0323)*(\u0302\0303)*\u0302

Can you still transform it and correctly infer the type of counters you
need for the final check (before rollbacks) if you replace it with:

(\u0302|\0323)*(\u0302|\0303)*\u0302 which is fully equivalent to
(\u0302|\0303|\0323)*\u03202.

You'd need to check that there are exactly
- (2n+1) occurences of \0302
- (n) occurences of \0303
- (n) occurences of \0323

But it won't be enough because \0302 and \0303 have the same combining
class and cannot be reordered. So within the first regexp:

(\u0302\u0302\0323)*(\u0302\0303)*\u0302

the first iterated subregexp will need to scan first within the part that
is to match in the second iterated subregexp, where you cannot predict
where it will stop. It may even scan completely through it (if you have not
encountered any \0303) and eaten the last \u0302. At this time, you may see
that the first iterated subregexp cannot contain any \u0303 so the first
rollback to do will be to rollback just before the 1st occurence of \0303.

But the counter check may still be wrong and you'll have to rollback
through one or two occurences of \u0302 in order to find the location where
the first iterated subregexp is satisfied. At this point the one ot two
occurences of \u0302 that you've rolled back will be counted as being part
of the 2nd iterated regexp, or even be the final occurence looked to match
the end of the regexp.

I don't see how you can support this regexp with a DFA, you absolutely need
an NFA (and the counters you want to add do not offer any decisive help).


2015-05-16 0:54 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Sat, 16 May 2015 00:31:53 +0200
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> > 2015-05-15 23:57 GMT+02:00 Richard Wordingham <
> > richard.wordingham at ntlworld.com>:
> >
> > > On Fri, 15 May 2015 22:09:13 +0200
> > > Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> > >
> > > > 2015-05-15 9:10 GMT+02:00 Richard Wordingham <
> > > > richard.wordingham at ntlworld.com>:
> > >
> > > > This is because you don't understand the issue !
> > >
> > > > > Now, a program to check whether a trace matching
> > > > > {\u0323|\u0302)* matches (\u0323\u0302)* is very simple.  It
> > > > > just counts the number of times \u0323 occurs and the number of
> > > > > times \u0302 occurs, and returns whether they are equal.
> > >
> > > > This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would
> > > > pass your counting test (which does not work in a FSA) but they
> > > > are NOT canonically equivalent because the identical combining
> > > > characters are blocking each other (so arbitrary ordering is not
> > > > possible).
> > >
> > > TUS7.0: D108   Reorderable pair:
> > >  Two adjacent characters A and B in a coded character sequence
> > >  <A, B> are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0.
> > >
> > > Now, ccc(U+0302) = 230 > 220 = ccc(U+0323) > 0, so (U+0302, U+0303)
> > > is a reorderable pair.
> > >
> >
> > I do NOT contest that U+0323 and U+0302 can reorder, but the fact that
> > U+0323 blocks another occurence of U+0323 because it has the **same**
> > combining class.
>
> How does that stop <U+0323, U+0323, U+0302, U+0302> and <U+0323,
> U+0302, U+0323, U+0302> being canonically equivalent?
>
> TUS7.0: D109   'Canonical Ordering Algorithm' says:
> "In a decomposed character sequence D, exchange the positions of the
> characters in each Reorderable Pair until the sequence contains no more
> Reorderable Pairs."
>
> There is no mention of blocking in D109.
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150516/9917c702/attachment.html>


More information about the Unicode mailing list