Pure Regular Expression Engines and Literal Clusters

Mon Oct 14 02:05:49 CDT 2019

> Date: Mon, 14 Oct 2019 01:10:45 +0100
> From: Richard Wordingham via Unicode <unicode at unicode.org>
> 
> >> Besides invalidating complexity metrics, the issue was what \p{Lu}
> >> should match.  For example, with PCRE syntax, GNU grep Version 2.25
> >> \p{Lu} matches U+0100 but not <A, U+0300>.  When I'm respecting
> >> canonical equivalence, I want both to match [:Lu:], and that's what
> >> I do. [:Lu:] can then match a sequence of up to 4 NFD characters.  
>  
> > Hopefully some experts here can tune in, explaining exactly what
> > regular expressions they have in mind.
> 
> The best indication lies at
> https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_Equivalents
> (2008), which is the last version before support for canonical
> equivalence was dropped as a requirement.
> 
> It's not entirely coherent, as the authors don't seem to find an
> expression like
> 
> \p{L}\p{gcb=extend}*
> 
> a natural thing to use, as the second factor is mostly sequences of
> non-starters.  At that point, I would say they weren't expecting
> \p{Lu} to not match  <A, U+0300>, as they were still expecting  [ä] to
> match both "ä" and "a\u0308".
> 
> They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*, and
> were expecting normalisation (even to NFC) to be a possible cure.  They
> had begun to realise that converting expressions to match all or none
> of a set of canonical equivalents was hard; the issue of non-contiguous
> matches wasn't mentioned.

I think these are two separate issues: whether search should normalize
(a.k.a. performs character folding) should be a user option.  You are
talking only about canonical equivalence, but there's also
compatibility decomposition, so, for example, searching for "1" should
perhaps match ¹ and ①.