Pure Regular Expression Engines and Literal Clusters

Mark Davis ☕️ via Unicode unicode at unicode.org
Fri Oct 11 20:37:18 CDT 2019

> You claimed the order of alternatives mattered.  That is an important
> issue for anyone rash enough to think that the standard is fit to be
> used as a specification.

Regex engines differ in how they handle the interpretation of the matching
of alternatives, and it is not possible for us to wave a magic wand to
change them.

What we can do is specify how the interpretation of the properties of
strings works. By specifying that they behave like alternation AND adding
the extra constraint of having longer first, we minimize the differences
across regex engines.

> I'm still not entirely clear what a regular expression /[\u00c1\u00e1]/
> can mean.  If the system uses NFD to simulate Unicode conformance,
> shall the expression then be converted to /[{A\u0301}{a\u0301}]/?  Or
> should it simply fail to match any NFD string?  I've been implementing
> the view that all or none of the canonical equivalents of a string
> match.  (I therefore support mildly discontiguous substrings, though I
> don't support splitting undecomposable characters.)

We came to the conclusion years ago that regex engines cannot reasonably be
expected to implement canonical equivalence; they are really working at a
lower level. So you see the advice we give at
http://unicode.org/reports/tr18/#Canonical_Equivalents. (Again, no magic

> Richard.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191011/b320e761/attachment.html>

More information about the Unicode mailing list