Pure Regular Expression Engines and Literal Clusters

Sun Oct 13 20:38:58 CDT 2019

On Sun, 13 Oct 2019 17:13:28 -0700
Asmus Freytag via Unicode <unicode at unicode.org> wrote:

> On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote:
> Besides invalidating complexity metrics, the issue was what \p{Lu}
> should match.  For example, with PCRE syntax, GNU grep Version 2.25
> \p{Lu} matches U+0100 but not <A, U+0300>.  When I'm respecting
> canonical equivalence, I want both to match [:Lu:], and that's what I
> do. [:Lu:] can then match a sequence of up to 4 NFD characters.
> 
> Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*;
> instead of formally handling NFD, you could extend the syntax to
> handle "inherited" properties across combining sequences.
> 
> Am I missing anything?

Yes.  There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:]
should not match <U+004D LATIN CAPITAL LETTER M, U+0302 COMBINING
CIRCUMFLEX ACCENT>.  Now, I could invent a string property so
that \p{xLu} that meant (:?\p{Lu}\p{Mn}*).

I don't entirely understand what you said; you may have missed the
distinction between "[:Lu:] can then match" and "[:Lu:] will then
match".  I think only Greek letters expand to 4 characters in NFD.

When I'm respecting canonical equivalence/working with traces, I want
[:insc=vowel_dependent:][:insc=tone_mark:] to match both <U+0E39 THAI
CHARACTER SARA UU, U+0E49 THAI CHARACTER MAI THO> and its canonical
equivalent <U+0E49, U+0E39>.  The canonical closure of that
sequence can be messy even within scripts.  Some pairs commute: others
don't, usually for good reasons.

Regards,

Richard.