Pure Regular Expression Engines and Literal Clusters

Markus Scherer via Unicode unicode at unicode.org
Thu Oct 10 17:23:00 CDT 2019


On Tue, Oct 8, 2019 at 7:28 AM Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> An example UTS#18 gives for matching a literal cluster can be simplified
> to, in its notation:
>
> [c \q{ch}]
>
> This is interpreted as 'match against "ch" if possible, otherwise
> against "c".  Thus the strings "ca" and "cha" would both match the
> expression
>
> [c \q{ch}]a
>
> while "chh" but not "ch" would match against
>
> [c \q{ch}]h
>

Right. We just independently discussed this today in the UTC meeting,
connected with the "properties of strings" discussion in the proposed
update.

[c \q{ch}]h should work like (ch|c)h. Note that the order matters in the
alternation -- so this works equivalently if longer strings are sorted
first.

May I correctly argue instead that matching against literal clusters
> would be satisfied by instead supporting, for this example, the regular
> subexpression "(c|ch)" or the UnicodeSet expression "[c{ch}]"?
>

ICU UnicodeSet [c{ch}] is equivalent to UTS #18 [c\q{ch}].

ICU's UnicodeSet syntax is simpler, the UTS #18 syntax is more
backward-compatible.

Best regards,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191010/4c43d0de/attachment.html>


More information about the Unicode mailing list