Regular Expressions and Canonical Equivalence

Sun May 17 19:03:02 CDT 2015

On Sun, 17 May 2015 16:33:15 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-05-16 22:33 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > On Sat, 16 May 2015 18:29:18 +0200
> > Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> >
> > > 2015-05-16 17:02 GMT+02:00 Richard Wordingham <
> > > richard.wordingham at ntlworld.com>:
> > >
> > > > There is an annoying error.  You appear to assume that U+0302
> > > > COMBINING CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute,
> > > > but they don't; they have the same combining class, namely
> > > > 230.  I'm going to assume that 0303 is a typo for 0323.
> > >
> > >
> > > Not a typo, and I did not made the assumption you suppose because
> > > I chose then so that they were effectively using the **same**
> > > combining class, so that they do not commute.
> >
> > In that case you have an even worse problem.  Neither the trace nor
> > the string \u0303\u0302\u0302 matches the pattern
> > (\u0302\u0302\0323)*(\u0302\0303)*\u0302, but the string does match
> > the regular expression
> > (˕\u0302˕\u0302˕\0323|˕\u0302˕\0323˕\u0302|˕\u0302˕\u0302˕\0323)*(˕\u0302˕\
> > 0303|˕\0303˕\u0302)*˕\u0302˕
> >
> > You've transformed (\u0302\u0303) into
> > (˕\u0302˕\0303|˕\0303˕\u0302), but that is unnecessary and wrong,
> > because U+0302 and U+0303 do not commute.
> 
> 
> Oh right! Thanks for pointing, it was intended you can read it as.
> 
> (˕\u0302˕\u0302˕\0323|˕\u0302˕\0323˕\u0302|˕\u0302˕\u0302˕\0323)*(˕\u0302˕\0303)*˕\u0302˕
> 
> But my argument remains because of the presence of \0302 in the second
> subregexp (which additionally is a separate capture, but here I'm not
> concentrating on the impact in numbered captures, but only on the
> global capture aka $0)
> 
> 
> > > It was the key fact of my argument that destroys your
> > > argumentation.
> >
> > However, \u0323\u0323\u0302\u0302\u0302\u0302 does not match the
> > corrected new regex
> >
> > (˕\u0302˕\u0302˕\u0323|˕\u0302˕\0323˕\u0302|˕\u0323˕\u0302˕\u0302)*(˕\u0302˕\u0303)*˕\u0302˕
> >
> > Do you claim that this argument is destroyed?  If it is irrelevant,
> > why is it irrelevant?  It shows that your transform does not solve
> > the original problem of missed matches.
> >
> 
> Why doesn't it solve it? 

Sorry, my example wasn't quite right.  It should have two combining
dots below and five circumflexes, not four as I wrote it.  I will
first explain how my NDnear-FA handles it - I have now removed the
generation of the dead end states.

Initial states:
 0) LLLL0 # Starting the \u0302\u0302\u0323 factor,
          # implemented as \u0323\u032\u0320
 1) LLRM  # Completed the zero trip alternative to (\u0302\u0302\u0323)+
          # Not actually useful.
 2) LRLL0 # Starting the \u0302\u0303 factor
 3) LRRM  # Completed the zero trip alternative to (\u0302\u0303)+
 4) R0    # Starting the \u0302 factor

=0323=00:06:=
LLLL0 => LLLL2 # \u0323\u0302\u0302 factor progressed as far as \u0323

=0323=06:012:=
LLLL2 => LLLN001220:2:L2 # Progressing 2 successive repeats of factor.
               # Both have progressed as far as \u0323.
               # Finiteness would restrict me to, say, 3 repeats
               # in progress.
# The states of the finite DFA are a cross product of 3 copies of
# the DFAs for \u0323\u0302\u0302 and 2 copies of the set of relevant
# ccc values.  By no means all of these states are used.

# In the Kleene stars of the regular expression guaranteed by
# recognisability, 3 copies caters for the worst case, xyz, where x
# has a starter and ends in a non-starter, y consists of non-starters
# with the same canonical combining class, and z starts with
# non-starter and contains a starter, e.g.  

# x = \u0f40\u0f74, y = \u0f7a\u0f7a\u0f7a, z = \u0f71\u0f42

# to_NFD(xyz) = \u0f40\u0f71\u0f7a\u0f7a\u0f7a\u0f74\u0f42

=0302=012:018:=
LLLN001220:2:L2 => LLLN001220:4:L2 # Still progressing two factors
               # First has progressed to \u0323\u0302 and second to
               # \u0323.  The other way round has been pruned by the
               # automated observation that if \u0302 is blocked from
               # first factor, the factor cannot be completed. 

=0302=018:024:=
LLLN001220:4:L2 => LLLN001220:M:L2 # Completed the first factor
LLLN001220:4:L2 => LLLL2  # As first factor is complete, remove it from
                          # consideration and relabel second factor as
                          # first.

=0302=024:030:=
LLLL2 => LLLL4  # \u0323\0302\u0302 completed as far as \u0323\u0302

=0302=030:036:=
LLLL4 => LLLLM  # \u0323\u0302\u0302 is complete.
LLLL4 => LRLL0  # So start \u0302\u0303 factor.
LLLL4 => LRRM   # Alternatively, completed the zero trip option of
                # (\u0302\u0303)*
LLLL4 => R0     # Or, we have progressed as far as the final \u0302
LLLL4 => LLLL0  # Or, start another \u0323\u0302\u0302

=0302=036:042:=
LRLL0 => LRLL2   # Got as far as \u0302 in \u0302\u0303
R0 => RM (match) # Or completed the final \u0302.
End marker is at 042:OVF

Could you please talk me through how your system recognises the string
\u0323\u0323\u0302\u0302\u0302\u0302\u0302 as matching the regex.  I
can't work out how it is supposed to work from your description.  

Richard.