Pure Regular Expression Engines and Literal Clusters

Richard Wordingham via Unicode unicode at unicode.org
Sun Oct 13 08:00:18 CDT 2019

On Sun, 13 Oct 2019 10:04:34 +0200
Hans Åberg via Unicode <unicode at unicode.org> wrote:

> > On 13 Oct 2019, at 00:37, Richard Wordingham via Unicode
> > <unicode at unicode.org> wrote:
> > 
> > On Sat, 12 Oct 2019 21:36:45 +0200
> > Hans Åberg via Unicode <unicode at unicode.org> wrote:
> >   
> >>> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode
> >>> <unicode at unicode.org> wrote:
> >>> 
> >>> But remember that 'having longer first' is meaningless for a
> >>> non-deterministic finite automaton that does a single pass through
> >>> the string to be searched.    
> >> 
> >> It is possible to identify all submatches deterministically in
> >> linear time without backtracking — I a made an algorithm for
> >> that.  
> > 
> > That's impressive, as the number of possible submatches for
> > a*(a*)a* is quadratic in the string length.  
> That is probably after the possibilities in the matching graph have
> been expanded, which can even be exponential. As an analogy, think of
> a polynomial product, I compute the product, not the expansion.

I'm now beginning to wonder what you are claiming. One thing one can
do without backtracking is to determine which capture groups capture
something, and which combinations of capturing or not occur.  That's
a straightforward extension of doing the overall 'recognition' in linear
time - at least, linear in length (n) of the searched string.  (I say
straightforward, but it would mess up my state naming algorithm.)  The
time can also depend on the complexity of the regular expression, which
can be bounded by the length (m) of the expression if working with mere
strings, giving time O(mn) if one doesn't undertake the worst case
O(2^m) task of converting the non-deterministic FSM to a deterministic

Using m as a complexity measure for traces may be misleading, and I
think plain wrong; for moderate m, the complexity can easily go up as
fast as m^10, and I think higher powers are possible.  Strings
exercising the higher complexities are linguistically implausible.



More information about the Unicode mailing list