Regular Expressions and Canonical Equivalence

Richard Wordingham richard.wordingham at ntlworld.com
Sun May 17 11:52:56 CDT 2015


On Sun, 17 May 2015 16:45:18 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-05-16 22:33 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > I'm not at all sure what your example string is.  I ran my program
> > to watch its progression with input \u0323\u0323\u0302\u0302, which
> > does not match the pattern, and attach the outputs for your scorn.
> > I have added comments started by #.
> >
> 
> Sorry for not commenting it, this is the internal tricks and outputs
> of your program, and your added comments does not allow me to
> interpret what all this means, i.e. the exact role of the notations
> with sequences or "L" or "R" or "N", and what the "=>" notation means
> (I suppose this is noting an advance rule and that the left-hand side
> is the state before, the right-hand-side is the state after, but I
> don't see where is the condition (the character or character class to
> match, or an error condition). You've only "explained" partly the NDE
> and ODE comments and the "!" when it is appended.

'ODE' and 'NDE' mean the transitions should not occur when I finish my
current set of edits.  The exclamation mark means the optimisation I
first though of wouldn't eliminate it.

> Is that really what your regexp engine outputs as its internally
> generated parser tables (only "friendly" serialized as a "readable"
> text) ?

When running the regex, I really do hold the states in forms like LLLL2
and LLLN001220:2:L2.  (The colons are unnecessary; I included them for
readability.)  It's designed for proof of principle, rather than high
speed.

There is also a tree corresponding to the analysis of the
regex; the nodes record how the lower level regexes are combined.  The
branching nodes in the example are for sequences.  In the simplest
case, a matching expression will, in some canonically equivalent form,
be the concatenation of a string matching the left hand node and a
string matching the right mode.  For iterations ('*' and '+', though I
treat '+' as basic), the tree does not need a corresponding right
branch, as all the information about the regex is held in the left
branch.  An 'L' means that the input sequence is proceeding
through the left branch.  An 'R' means that it has completed its
passage through the left branch, and is now proceeding through the
right branch.  All this would be applicable if I were ignoring
canonical equivalence.

An 'N' (for 'normalisation') means that parsing is passing through the
region where the normalisation has interleaved the left and right hand
component strings.  As I consider each fresh character, I have to
consider its canonical combining class.  The string for the state
records what ccc is blocked from the left hand string.  As I take the
characters from the input string in NFD order, I only need to remember
the highest blocked ccc.  The first character I receive with a lower
ccc will be a starter, at which point I will only be progressing the
right hand component string.  For the state in the parent regex, I
record the 'N' (as opposed to an 'L' or 'R'), the highest blocked ccc,
the state in the left-hand regex and the state in the right-hand regex.

The input characters are recorded in the form

=<character scalar value>=<this character start location>:<next
character start location>:=

The character location is recorded in the from <part><byte offset>.

The part is a single digit.  0 means whole character, 1 means first
character in character decomposing to multiple characters, 2 means
second and so on.

Thus, as the first U+0302 is stored as 6-character escape code
'\u0302', I record the position as:

=0302=06:012:=

I then record the consequential transition from each state to another
state.  As the basic structure is that of a non-deterministic finite
automaton (as at
https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton ),
there may be no or many transitions from a particular state.  There are
no error conditions as such.  As I record each transition, I record
whether there is now a match to the whole regex and whether the state
is a duplicate.  Detecting duplicates is part of the key to the
classical NFA's better resistance to 'pathological inputs' compared to
back-tracking algorithms.

There are two main state numberings for the bottom level regexes.  The
main bottom level regex is a simple regex with no alternates or
groupings.  The engine propagates the simple regex as a string and
records the state as the byte offset of the next character to compare
against.  The regex is stored in Latin-1 or UTF-8.  (Latin-1 is not
suitable for precomposed characters.)  Thus when the first character
input is U+0323 and is compared against the regex \u0323\u0302\u0302,
the state for the regex changes from 0 to 2, as U+0323 occupies 2 bytes
in UTF-8.  This is recorded as an overall state transition 'LLLL0 =>
LLLL2'.  When all characters in the string have been matched, the state
becomes 'M'.  The simple regexes have one-to-many state progressions to
handle iterations and optionality ('*', '+' and '?').

The second system is for Unicode properties.  The state records the
composition of precomposed characters by using the accumulated
codepoint as the state.  However, the state also includes a success
flag for ease of composing the acceptance or otherwise of the overall
state and to determine transitions from one regex to the next.

My program does not calculate what the characters are for a
state transition to occur.  Instead, it calculates what transitions
occur in response to an input character.

Richard.


More information about the Unicode mailing list