Regexes, Canonical Equivalence and Backtracking of Input
richard.wordingham at ntlworld.com
Mon May 18 13:35:45 CDT 2015
Philippe and I have got bogged down in a long discussion of how to
parse traces of Unicode strings under canonical equivalence against
non-regular Kleene star of regular expressions. Fortunately, such
expressions can be expected to have very little use. A seemingly simple
example is the regex \u0f73* i.e. any number of occurrences of U+0F73
TIBETAN VOWEL SIGN II, and not \u0f71\u0f72*. An example of a string
matching under canonical equivalence is 0F71 0F71 0F72 0F72.
I believe we both thought that characters would arrive from the trace
in a deterministic order. Now, many regular expression engines
back-track their parsing of the input string (no-one has reported
working with input traces). A possibly useful trick would be for
characters to be taken from the input file in accordance with the
matching to the pattern, with input also back-tracked if matching
fails. The notion of next character would depend on the state of the
In the example above, the engine would just take the input in the
order 0F71 0F72 0F71 0F72. Match found, job done.
One advantage of this scheme is that there would be no need for
adjustments to deal with the interleaving of adjacent matches to
successive subexpressions. There would be no nagging worry that
one's rational expression was not a regular expression when applied
Any theoreticians around may be wondering how this magic is achieved.
The simple answer is that the non-finiteness has been transferred to:
(1) the back-tracking through parse options; and
(2) the algorithm to walk through the character sequencing options.
The algorithm itself should be tractable - Mark Davis has published
an algorithm to generate all strings canonically equivalent to a
Unicode string, and what we need might not be so complex.
I offer this thought up as it seems that, for a regex engine working on
traces with deterministic input, the byte code for a regex
concatenation AB or iteration A* is much more complicated than the code
for the subregexes A and B. I have a worry that the length of the
compiled code might even be exponential with the length of the regex.
(I may be wrong - there might be a limit to what one can do for worst
case complexity of the interleaving.) Choosing the input to match the
regex would remove this problem.
More information about the Unicode