Pure Regular Expression Engines and Literal Clusters

Hans Åberg via Unicode unicode at unicode.org
Mon Oct 14 08:08:01 CDT 2019



> On 14 Oct 2019, at 02:10, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Mon, 14 Oct 2019 00:22:36 +0200
> Hans Åberg via Unicode <unicode at unicode.org> wrote:
> 
>>> On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode
>>> <unicode at unicode.org> wrote:
> 
>>> Besides invalidating complexity metrics, the issue was what \p{Lu}
>>> should match.  For example, with PCRE syntax, GNU grep Version 2.25
>>> \p{Lu} matches U+0100 but not <A, U+0300>.  When I'm respecting
>>> canonical equivalence, I want both to match [:Lu:], and that's what
>>> I do. [:Lu:] can then match a sequence of up to 4 NFD characters.  
> 
>> Hopefully some experts here can tune in, explaining exactly what
>> regular expressions they have in mind.
> 
> The best indication lies at
> https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_Equivalents

The certificate has expired, one day ago, risking to steal personal and financial information says the browser, refusing to load it. So one has to load the totally insecure HTTP page for risk of creating a mayhem on the computer. :-)

> (2008), which is the last version before support for canonical
> equivalence was dropped as a requirement.

As said there, one might add all the equivalents if one can find them. Alternatively, one could normalize the regex and the string, keeping track of the translation boundaries on the string so that it can be translated back to a match on the original string if called for.





More information about the Unicode mailing list