Regexes, Canonical Equivalence and Backtracking of Input

Richard Wordingham richard.wordingham at ntlworld.com
Mon May 18 15:32:02 CDT 2015


On Mon, 18 May 2015 21:05:49 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-05-18 20:35 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > The algorithm itself should be tractable - Mark Davis has published
> > an algorithm to generate all strings canonically equivalent to a
> > Unicode string, and what we need might not be so complex.
> 
> 
> Even this algorithm from Mark Davis will fail in this case:

How so?  The regexp is \u0F73*, which is converted to a non-capturing
(\u0F71\u0F72)*.

Given a string 0F40 0F71 0F73 0F42 representing the trace, matching
will fail at 0F40 and an attempt will be made starting at the 0F71.
The input string handling part will then present a run of three
non-starters:

\u0F71 \u0F71 \u0F72

I think the process is even simpler than I first thought.

The engine will look for a match for \u0F71, and take it from this
list, leaving \u0F71 \u0F72.

It will then look for a match for \u0F72, and take it form the list,
leaving \u0F71.

It will then look for a match for \u0F71, and take it from the list.

It will then look for a match for \u0F72.  It will fail, and then back
track, disgorging the \0F71.

The input 'stream' now looks like \u0F71 \u0F42.  This will match
nothing; it is after the matching substream. 

The matching substring is:

None of 0F40, all of 0F71, the second part of 0F72 and none of 0F42.

Its value, as a trace, is 0F71 0F72.

> - You can use it easily to transform a regexp containing (\u0F73)
> into a regexp containing  (\u0F73|\u0F71\u0F72|\u0F71\u0F72)

That is *not* what I am suggesting.  The regex needs decomposing, but
no other transformations.  It is the string representing the input
trace that is expanded.

> - But this leaves the same problem for unbounded repetititions with
> the "+" or "*" or "{m,}" operators.

Not at all - that is the beauty of the scheme.  On the regex
side, \u0F73* is as straight forward as non-capturing (\u0061\u0062)*.
Putting back the unused fragments of the run of non-starters in the
input trace is the most difficult part.

> Now all the problem is how to do the backtracking,

Yes, that may be more difficult than I thought.  Comparing against
literal characters is simple, but it may be more complicated when
matching against a list of alternative characters.  Back-tracking
schemes may not be set up to try the next character on a list of
alternatives, e.g. so that pattern (\u0f72|\u0f71)\u0f72 matches input
string 0F71 0F72.  The alternative (\u0f72|\u0f71) would first take the
0F72, and only on backtracking would it take the 0F71 instead.  This is
an issue with traces that does not appear with strings.

Richard.


More information about the Unicode mailing list