Regular Expressions and Canonical Equivalence

Philippe Verdy verdy_p at wanadoo.fr
Thu May 14 19:10:36 CDT 2015


2015-05-14 20:13 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Thu, 14 May 2015 12:58:29 +0200
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> > 2015-05-14 9:59 GMT+02:00 Richard Wordingham <
> > richard.wordingham at ntlworld.com>:
> >
> > > An elegant formal solution to the Kleene star problem interprets
> > > (\u0323\u0302)* as (\u0323|\u0302)*.  However, that is
> > > counter-intuitive
>
> The technical term for this is the 'concurrent iteration' - or at
> least, that's the term used in the 'Book of Traces'.
>
> > For your example "(\u0323\u0302)*" the characters in the alternatives
> > (COMBINING DOT BELOW and COMBINING ACUTE ACCENT), once converted to
> > NFD (which is the same here) are just using at most two distinct
> > non-zero combining classes and no blocker; so it is safe to trasform
> > it to (\u0323|\u0302)* for a first pass matching that will then only
> > check candidate matchs in the second pass. or more efficiently, a
> > second finite state automata (FSA) running in parallel with its own
> > state:
>
> You've forgotten the basic problem.  A *finite* state automaton cannot
> count very far; with only n states, it cannot count as far as n.
>

I did not forget it, this is why there's a second pass (or a second FSA
running in parallel to indicate its own accept state). You have to combine
the two states variables to get the final combined state to determine if it
is a final accept state.

But one of the two state variable has an upper bound which is not only
finite but very small (it has at most 255 possible values).

Typical regexp engines do not create the full deterministic automata with
all its states (it would require a lot of memory due to combinatorial
effects, they handle multiple state variables in parallel and use a
rendez-vous system to test them in order to determine if we have an accept
state, or a fail state (for which we must rollback).

So even if one of the state is not bound in terms of length, the other one
(exploring the possible lengths of reorderable of non-blocking combining
characters) is clearly finite (so you don't need to count very far.

So you just need a single additional byte of storage for storing the second
state variable in the global state of your FSA. The size of the global
state variable only depends on the number of alternatives in your regexp
and it is also bound (limited to the length of the source regexp: even if
if your regexp speicification string is 1000 characters long, you know that
you will never need more than 1000 bytes to represent it, but of course it
will not be a simple 32-bit integer: this structure can represent billions
of billions of billions of possible states without needing to transform the
FSA to a pure deterministic FSA with a single integer and without having to
build a single MxN transition matrix (with M columns for each possible
character class, and N rows for each each deterministic state, where each
cell contains the value of for next deterministic state: this will not work)

Even if your regexp is so complex that it requires a specification string
that is 100KB long, your global state variable will never be longer than
100KB.

But of course, this structure is a bit less easier to use when advancing:
you have to advance all active states in parallel using the current input
character with each transition submatrix (which is really small as well
with just a couple of elements that can fit in a small structure with fixed
size: an accept character, limited to 21 bits with Unicode, or a character
class index, and a few flags, 3 bits, for saying if this character is
advancable, or if the current state is an accept state or a failure state,
or to indicate the presence of an alternative and give an index to its own
branch by specifying only which elementary state variable is used for that
alternative within the structure of the global state variable).

In summary the global state varaible is just a small array of 32 bit
integers for the most complex regexps you will encounter (I don't think
that 100KB regexps are very common, almost of them are below 1KB, so your
global state variable will fit in 4KB of memory, and the transition matrix
will also fit in 4KB).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150515/a458cadf/attachment.html>


More information about the Unicode mailing list