Pure Regular Expression Engines and Literal Clusters
Elizabeth Mattijsen via Unicode
unicode at unicode.org
Fri Oct 11 05:39:56 CDT 2019
> On 11 Oct 2019, at 00:23, Markus Scherer via Unicode <unicode at unicode.org> wrote:
>
> On Tue, Oct 8, 2019 at 7:28 AM Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> An example UTS#18 gives for matching a literal cluster can be simplified
> to, in its notation:
>
> [c \q{ch}]
>
> This is interpreted as 'match against "ch" if possible, otherwise
> against "c". Thus the strings "ca" and "cha" would both match the
> expression
>
> [c \q{ch}]a
>
> while "chh" but not "ch" would match against
>
> [c \q{ch}]h
>
> Right. We just independently discussed this today in the UTC meeting, connected with the "properties of strings" discussion in the proposed update.
>
> [c \q{ch}]h should work like (ch|c)h. Note that the order matters in the alternation -- so this works equivalently if longer strings are sorted first.
>
> May I correctly argue instead that matching against literal clusters
> would be satisfied by instead supporting, for this example, the regular
> subexpression "(c|ch)" or the UnicodeSet expression "[c{ch}]"?
>
> ICU UnicodeSet [c{ch}] is equivalent to UTS #18 [c\q{ch}].
>
> ICU's UnicodeSet syntax is simpler, the UTS #18 syntax is more backward-compatible.
Not quite following this discussion, but I got triggered by the use of Perl in this discussion.
In Perl 6 (which is a different language from Perl 5 altogether), regular expressions have been completely revamped.
In Perl 6, the use of "|" indicates alternatives using longest token matching (LTM):
https://docs.perl6.org/language/regexes#index-entry-regex_|-Longest_alternation:_|
In Perl 6, the use of "||" indicates first matching alternative wins:
https://docs.perl6.org/language/regexes#index-entry-regex_||-Alternation:_||
Furthermore, Perl 6 uses Normalization Form Grapheme for matching:
https://docs.perl6.org/type/Cool#index-entry-Grapheme
Hope this has some relevance to this discussion / gives new viewpoints.
Elizabeth Mattijsen
More information about the Unicode
mailing list