Pure Regular Expression Engines and Literal Clusters

Sun Oct 13 15:14:10 CDT 2019

> On 13 Oct 2019, at 21:17, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Sun, 13 Oct 2019 15:29:04 +0200
> Hans Åberg via Unicode <unicode at unicode.org> wrote:
> 
>>> On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode
>>> I'm now beginning to wonder what you are claiming.  
> 
>> I start with a NFA with no empty transitions and apply the subset DFA
>> construction dynamically for a given string along with some reverse
>> NFA-data that is enough to transverse backwards when a final state
>> arrives. The result is a NFA where all transverses is a match of the
>> string at that position.
> 
> And then the speed comparison depends on how quickly one can extract
> the match information required from that data structure.

Yes. For example, one should match the saved DFA in constant time, if matched as dynamic sets which is linear in set size, then one can get quadratic time complexity in string size.

Even though one can iterate through each match NFA in linear time, it could have say two choices at each character position each leading to the next, which would give an exponential size relative the string length.

Normally one is not interested in all matches, this is the disambiguation rules that do that.

> Incidentally, at least some of the sizes and timings I gave seem to be
> wrong even for strings.  They won't work with numeric quantifiers, as
> in /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/.

For those, one normally implements a loop iteration. I did not that. I mentioned this method to Tim Shen on the libstdc++ list, so perhaps he might have implemented something.

> One gets lesser issues in quantifying complexity if one wants "Å" to
> match \p{Lu} when working in NFD - potentially a different state for
> each prefix of the capital letters.  (It's also the case except for
> UTF-32 if characters are treated as sequences of code units.)  Perhaps
> 'upper case letter that Unicode happens to have encoded as a single
> character' isn't a concept that regular expressions need to support
> concisely. What's needed is to have a set somewhere between
> [\p{Lu}&\p{isNFD}] and [\p{Lu}],though perhaps it should be extended to
> include "ff" - there are English surnames like "ffrench”.

I made some C++ templates that translate Unicode code point character classes into UTF-8/32 regular expressions. So anything that can be reduced to actual regular expressions would work.