Pure Regular Expression Engines and Literal Clusters

Sun Oct 13 19:10:45 CDT 2019

On Mon, 14 Oct 2019 00:22:36 +0200
Hans Åberg via Unicode <unicode at unicode.org> wrote:

> > On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode
> > <unicode at unicode.org> wrote:

>> Besides invalidating complexity metrics, the issue was what \p{Lu}
>> should match.  For example, with PCRE syntax, GNU grep Version 2.25
>> \p{Lu} matches U+0100 but not <A, U+0300>.  When I'm respecting
>> canonical equivalence, I want both to match [:Lu:], and that's what
>> I do. [:Lu:] can then match a sequence of up to 4 NFD characters.  

> Hopefully some experts here can tune in, explaining exactly what
> regular expressions they have in mind.

The best indication lies at
https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_Equivalents
(2008), which is the last version before support for canonical
equivalence was dropped as a requirement.

It's not entirely coherent, as the authors don't seem to find an
expression like

\p{L}\p{gcb=extend}*

a natural thing to use, as the second factor is mostly sequences of
non-starters.  At that point, I would say they weren't expecting
\p{Lu} to not match  <A, U+0300>, as they were still expecting  [ä] to
match both "ä" and "a\u0308".

They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*, and
were expecting normalisation (even to NFC) to be a possible cure.  They
had begun to realise that converting expressions to match all or none
of a set of canonical equivalents was hard; the issue of non-contiguous
matches wasn't mentioned.

When I say 'hard', I'm thinking of the problem that concatenation may
require dissolution of the two constituent expressions and involve the
temporary creation of 54-fold (if text is handled as NFD) or 2^54-fold
(no normalisation) sets of extra states.  That's what's driven me to
write my own regular expression engine for traces.

Regards,

Richard.