Pure Regular Expression Engines and Literal Clusters

Richard Wordingham via Unicode unicode at unicode.org
Sat Oct 12 07:17:55 CDT 2019


On Fri, 11 Oct 2019 18:37:18 -0700
Mark Davis ☕️ via Unicode <unicode at unicode.org> wrote:

> >
> > You claimed the order of alternatives mattered.  That is an
> > important issue for anyone rash enough to think that the standard
> > is fit to be used as a specification.
> >  
> 
> Regex engines differ in how they handle the interpretation of the
> matching of alternatives, and it is not possible for us to wave a
> magic wand to change them.

But you are close to waving a truncheon to deprecate some of them.  And
even if you do not wave the truncheon, you will provide other people a
stick to beat them with.

> What we can do is specify how the interpretation of the properties of
> strings works. By specifying that they behave like alternation AND
> adding the extra constraint of having longer first, we minimize the
> differences across regex engines.

But remember that 'having longer first' is meaningless for a
non-deterministic finite automaton that does a single pass through the
string to be searched.

> > I'm still not entirely clear what a regular
> > expression /[\u00c1\u00e1]/ can mean.  If the system uses NFD to
> > simulate Unicode conformance, shall the expression then be
> > converted to /[{A\u0301}{a\u0301}]/?  Or should it simply fail to
> > match any NFD string?  I've been implementing the view that all or
> > none of the canonical equivalents of a string match.  (I therefore
> > support mildly discontiguous substrings, though I don't support
> > splitting undecomposable characters.) 
> 
> We came to the conclusion years ago that regex engines cannot
> reasonably be expected to implement canonical equivalence; they are
> really working at a lower level.

So does a lot of text processing.  The issue should simply be that the
change is too complicated for straightforward implementation:

(1) One winds up with slightly discontiguous substrings: the
non-starters at the beginning and end may not be contiguous.

(2) If one does not work with NFD, one ends up with parts of characters
in substrings.

(3) If one does not work with NFD (thereby formally avoiding the
issue of Unicode equivalence), replacing a non-starter by a character
of a different ccc is in general not a Unicode-compliant process.  (This
avoidance technique can be necessary for the Unicode Collation
Algorithm.)

(4) The algorithm for recognising concatenation and iteration (more
precisely, their closures under canonical equivalence) need to be
significantly rewritten.  One needs to be careful with optimisation -
some approaches could lead to reducing an FSM with over 2^54 states.

The issue of concatenation and iteration is largely solved in the
theory of traces and regular expressions, though there is still the
issue of when the iteration (Kleene star) of a regular expression (for
traces) is itself regular.  In the literature, this issue is called the
'star problem'.  One practical answer is that the Kleene star is itself
regular if it is generated from the set of strings matching the regular
expression that either contain NFD non-starters or all of whose
characters have the same ccc.  An unchecked requirement that
Kleene stars all be of this form would probably not be too great
a problem - one could probably dress this up by 'only fully supporting
Kleene star that is the same as the "concurrent star"'. Another one is
that recognition algorithms do not need to restrict themselves to
*regular* expressions - back references are not 'regular' either.

/\u0F73*/ is probably the simplest example of a non-regular Kleene star
in the Unicode strings under canonical equivalence.  (That character is
a problem character for ICU collation.)
However, /[[:Tibetan:]&[:insc=vowel_dependent:]]*/ is regular, as
removing U+0F73 from the Unicode set does not change its iteration.
Contrariwise, there might be a formal issue with giving <U+0F73 TIBETAN
VOWEL SIGN II, U+0F73, U+0F73> preference over <U+0F7A TIBETAN
VOWEL SIGN E, U+0F7A, U+0F7A> if one used the iteration algorithm for
regular-only Kleene star.

> So you see the advice we give at
> http://unicode.org/reports/tr18/#Canonical_Equivalents. (Again, no
> magic wand.)

So who's got tools for converting the USE's expression for a
'standard cluster' into a regular expression that catches all NFD
equivalents of the original expression?  There may be perfection issues
- the star problem may be unsolved for sequences of Unicode strings
under canonical equivalence.  Annoyingly, I can't find any text but my
own that relates traces to Unicode!

The trick of converting strings to NFD before searching them is
certainly useful.  Even with an engine respecting canonical
equivalence, it cuts the 2^54 I mentioned down to 54, the
number of non-zero canonical combining classes currently in use.  Of
course, such a reduction is not fully consistent with the spirit of a
finite state machine.

Richard.




More information about the Unicode mailing list