Pure Regular Expression Engines and Literal Clusters
Richard Wordingham via Unicode
unicode at unicode.org
Sun Oct 13 16:54:12 CDT 2019
On Sun, 13 Oct 2019 22:14:10 +0200
Hans Åberg via Unicode <unicode at unicode.org> wrote:
> > On 13 Oct 2019, at 21:17, Richard Wordingham via Unicode
> > <unicode at unicode.org> wrote:
> > Incidentally, at least some of the sizes and timings I gave seem to
> > be wrong even for strings. They won't work with numeric
> > quantifiers, as in /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/.
> > One gets lesser issues in quantifying complexity if one wants "Å" to
> > match \p{Lu} when working in NFD - potentially a different state for
> > each prefix of the capital letters. (It's also the case except for
> > UTF-32 if characters are treated as sequences of code units.)
> > Perhaps 'upper case letter that Unicode happens to have encoded as
> > a single character' isn't a concept that regular expressions need
> > to support concisely. What's needed is to have a set somewhere
> > between [\p{Lu}&\p{isNFD}] and [\p{Lu}],though perhaps it should be
> > extended to include "ff" - there are English surnames like
> > "ffrench”.
The point about these examples is that the estimate of one state per
character becomes a severe underestimate. For example, after
processing 20 a's, the NFA for /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/ can
be in any of about 50 states. The number of possible states is not
linear in the length of the expression. While a 'loop iteration' can
keep the size of the compiled regex down, it doesn't prevent the
proliferation of states - just add zeroes to my example.
> I made some C++ templates that translate Unicode code point character
> classes into UTF-8/32 regular expressions. So anything that can be
> reduced to actual regular expressions would work.
Besides invalidating complexity metrics, the issue was what \p{Lu}
should match. For example, with PCRE syntax, GNU grep Version 2.25
\p{Lu} matches U+0100 but not <A, U+0300>. When I'm respecting
canonical equivalence, I want both to match [:Lu:], and that's what I
do. [:Lu:] can then match a sequence of up to 4 NFD characters.
Regards,
Richard.
More information about the Unicode
mailing list