Pure Regular Expression Engines and Literal Clusters
Hans Åberg via Unicode
unicode at unicode.org
Sun Oct 13 17:22:36 CDT 2019
> On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
>
> The point about these examples is that the estimate of one state per
> character becomes a severe underestimate. For example, after
> processing 20 a's, the NFA for /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/ can
> be in any of about 50 states. The number of possible states is not
> linear in the length of the expression. While a 'loop iteration' can
> keep the size of the compiled regex down, it doesn't prevent the
> proliferation of states - just add zeroes to my example.
Formally only the expansion of such ranges are NFA, and I haven’t seen anyone considering the complexity with them included. So to me, it seems just a hack.
>> I made some C++ templates that translate Unicode code point character
>> classes into UTF-8/32 regular expressions. So anything that can be
>> reduced to actual regular expressions would work.
>
> Besides invalidating complexity metrics, the issue was what \p{Lu}
> should match. For example, with PCRE syntax, GNU grep Version 2.25
> \p{Lu} matches U+0100 but not <A, U+0300>. When I'm respecting
> canonical equivalence, I want both to match [:Lu:], and that's what I
> do. [:Lu:] can then match a sequence of up to 4 NFD characters.
Hopefully some experts here can tune in, explaining exactly what regular expressions they have in mind.
More information about the Unicode
mailing list