Pure Regular Expression Engines and Literal Clusters

Sun Oct 13 14:17:54 CDT 2019

On Sun, 13 Oct 2019 15:29:04 +0200
Hans Åberg via Unicode <unicode at unicode.org> wrote:

> > On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode
> > I'm now beginning to wonder what you are claiming.  

> I start with a NFA with no empty transitions and apply the subset DFA
> construction dynamically for a given string along with some reverse
> NFA-data that is enough to transverse backwards when a final state
> arrives. The result is a NFA where all transverses is a match of the
> string at that position.

And then the speed comparison depends on how quickly one can extract
the match information required from that data structure.

Incidentally, at least some of the sizes and timings I gave seem to be
wrong even for strings.  They won't work with numeric quantifiers, as
in /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/.

One gets lesser issues in quantifying complexity if one wants "Å" to
match \p{Lu} when working in NFD - potentially a different state for
each prefix of the capital letters.  (It's also the case except for
UTF-32 if characters are treated as sequences of code units.)  Perhaps
'upper case letter that Unicode happens to have encoded as a single
character' isn't a concept that regular expressions need to support
concisely. What's needed is to have a set somewhere between
[\p{Lu}&\p{isNFD}] and [\p{Lu}],though perhaps it should be extended to
include "ff" - there are English surnames like "ffrench".

Regards,

Richard.