Pure Regular Expression Engines and Literal Clusters

Fri Oct 11 05:39:56 CDT 2019

> On 11 Oct 2019, at 00:23, Markus Scherer via Unicode <unicode at unicode.org> wrote:
> 
> On Tue, Oct 8, 2019 at 7:28 AM Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> An example UTS#18 gives for matching a literal cluster can be simplified
> to, in its notation:
> 
> [c \q{ch}]
> 
> This is interpreted as 'match against "ch" if possible, otherwise
> against "c".  Thus the strings "ca" and "cha" would both match the
> expression
> 
> [c \q{ch}]a
> 
> while "chh" but not "ch" would match against
> 
> [c \q{ch}]h
> 
> Right. We just independently discussed this today in the UTC meeting, connected with the "properties of strings" discussion in the proposed update.
> 
> [c \q{ch}]h should work like (ch|c)h. Note that the order matters in the alternation -- so this works equivalently if longer strings are sorted first.
> 
> May I correctly argue instead that matching against literal clusters
> would be satisfied by instead supporting, for this example, the regular
> subexpression "(c|ch)" or the UnicodeSet expression "[c{ch}]"?
> 
> ICU UnicodeSet [c{ch}] is equivalent to UTS #18 [c\q{ch}].
> 
> ICU's UnicodeSet syntax is simpler, the UTS #18 syntax is more backward-compatible.

Not quite following this discussion, but I got triggered by the use of Perl in this discussion.

In Perl 6 (which is a different language from Perl 5 altogether), regular expressions have been completely revamped.

In Perl 6, the use of "|" indicates alternatives using longest token matching (LTM):
   https://docs.perl6.org/language/regexes#index-entry-regex_|-Longest_alternation:_|

In Perl 6, the use of "||" indicates first matching alternative wins:
    https://docs.perl6.org/language/regexes#index-entry-regex_||-Alternation:_||

Furthermore, Perl 6 uses Normalization Form Grapheme for matching:
    https://docs.perl6.org/type/Cool#index-entry-Grapheme

Hope this has some relevance to this discussion / gives new viewpoints.

Elizabeth Mattijsen