Pure Regular Expression Engines and Literal Clusters
Richard Wordingham via Unicode
unicode at unicode.org
Sun Oct 13 20:38:58 CDT 2019
On Sun, 13 Oct 2019 17:13:28 -0700
Asmus Freytag via Unicode <unicode at unicode.org> wrote:
> On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote:
> Besides invalidating complexity metrics, the issue was what \p{Lu}
> should match. For example, with PCRE syntax, GNU grep Version 2.25
> \p{Lu} matches U+0100 but not <A, U+0300>. When I'm respecting
> canonical equivalence, I want both to match [:Lu:], and that's what I
> do. [:Lu:] can then match a sequence of up to 4 NFD characters.
>
> Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*;
> instead of formally handling NFD, you could extend the syntax to
> handle "inherited" properties across combining sequences.
>
> Am I missing anything?
Yes. There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:]
should not match <U+004D LATIN CAPITAL LETTER M, U+0302 COMBINING
CIRCUMFLEX ACCENT>. Now, I could invent a string property so
that \p{xLu} that meant (:?\p{Lu}\p{Mn}*).
I don't entirely understand what you said; you may have missed the
distinction between "[:Lu:] can then match" and "[:Lu:] will then
match". I think only Greek letters expand to 4 characters in NFD.
When I'm respecting canonical equivalence/working with traces, I want
[:insc=vowel_dependent:][:insc=tone_mark:] to match both <U+0E39 THAI
CHARACTER SARA UU, U+0E49 THAI CHARACTER MAI THO> and its canonical
equivalent <U+0E49, U+0E39>. The canonical closure of that
sequence can be messy even within scripts. Some pairs commute: others
don't, usually for good reasons.
Regards,
Richard.
More information about the Unicode
mailing list