Pure Regular Expression Engines and Literal Clusters

Richard Wordingham via Unicode unicode at unicode.org
Sat Oct 12 17:03:17 CDT 2019


On Fri, 11 Oct 2019 12:39:56 +0200
Elizabeth Mattijsen via Unicode <unicode at unicode.org> wrote:


> Furthermore, Perl 6 uses Normalization Form Grapheme for matching:
>     https://docs.perl6.org/type/Cool#index-entry-Grapheme

This approach does address the issue Mark Davis mentioned about regex
engines working at the wrong level.  Perhaps you can put my mind at
rest about whether it works at all with scripts that subordinate
vowels.

If I wanted to find the occurrences of the Pali word _pacati_ 'to cook'
in Latin script text using form NFG, I could use a Perl regular
expression like /\b(:?a|pa)?p[aā]c(:?\B.)*/.  (At least,

grep -P '\b(:?a|pa)?p[aā]c\p{Ll}*' file.txt

works on text in NFC.  I couldn't work out the command-line expression
to display a list of matches from Perl, and the PCRE \B is broken beyond
ASCII in GNU grep 2.25.)

How would I do such a search in an Indic script using form NFG?

The main issue is that the single character 'c' would have to expand to
a list of all but one of the Pali grapheme clusters whose initial
consonant transliterates to 'c'.  Have you a notation for such a class?

Regards,

Richard.



More information about the Unicode mailing list