Pure Regular Expression Engines and Literal Clusters

Fri Oct 11 20:02:12 CDT 2019

On Fri, 11 Oct 2019 14:35:33 -0700
Markus Scherer via Unicode <unicode at unicode.org> wrote:

> > > [c \q{ch}]h should work like (ch|c)h. Note that the order matters
> > > in the alternation -- so this works equivalently if longer
> > > strings are sorted first.  

> > Does conformance UTS#18 to level 2 mandate the choice of matching
> > substring? This would appear to prohibit compliance to POSIX rules,
> > where the length of overall match counts.

> The idea is currently to specify properties-of-strings (and I think a
> range/class with "clusters") behaving like an alternation where the
> longest strings are first, and leaving it up to the regex engine
> exactly what that means.
> 
> In general, UTS #18 offers a lot of things that regex implementers
> may or may not adopt.

> If you have specific ideas, please send them as PRI feedback.
> (Discussion on the list is good and useful, but does not guarantee
> that it gets looked at when it counts.)

You claimed the order of alternatives mattered.  That is an important
issue for anyone rash enough to think that the standard is fit to be
used as a specification.

I'm still not entirely clear what a regular expression /[\u00c1\u00e1]/
can mean.  If the system uses NFD to simulate Unicode conformance,
shall the expression then be converted to /[{A\u0301}{a\u0301}]/?  Or
should it simply fail to match any NFD string?  I've been implementing
the view that all or none of the canonical equivalents of a string
match.  (I therefore support mildly discontiguous substrings, though I
don't support splitting undecomposable characters.)

Richard.