Regular Expressions and Canonical Equivalence

Richard Wordingham richard.wordingham at ntlworld.com
Sat May 16 15:33:55 CDT 2015


On Sat, 16 May 2015 18:29:18 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-05-16 17:02 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > There is an annoying error.  You appear to assume that U+0302
> > COMBINING CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute, but
> > they don't; they have the same combining class, namely 230.  I'm
> > going to assume that 0303 is a typo for 0323.
> 
> 
> Not a typo, and I did not made the assumption you suppose because I
> chose then so that they were effectively using the **same** combining
> class, so that they do not commute.

In that case you have an even worse problem.  Neither the trace nor the
string \u0303\u0302\u0302 matches the pattern
(\u0302\u0302\0323)*(\u0302\0303)*\u0302, but the string does match the
regular expression
(˕\u0302˕\u0302˕\0323|˕\u0302˕\0323˕\u0302|˕\u0302˕\u0302˕\0323)*(˕\u0302˕\
0303|˕\0303˕\u0302)*˕\u0302˕

You've transformed (\u0302\u0303) into (˕\u0302˕\0303|˕\0303˕\u0302),
but that is unnecessary and wrong, because U+0302 and U+0303 do not
commute. 

> It was the key fact of my argument that destroys your argumentation.

Which argument?  Restoring the \u303, the fact that remains that

\u0323\u0323\u0302\u0302\u0302\u0302 is canonically equivalent to
\u0302\u0302\u0323\u0302\u0323\u0302, which clearly matches the
corrected old regex (\u0302\u0302\u0323)*(\u0302\u0303)*\u0302.
However, \u0323\u0323\u0302\u0302\u0302\u0302 does not match the
corrected new regex
(˕\u0302˕\u0302˕\u0323|˕\u0302˕\0323˕\u0302|˕\u0323˕\u0302˕\u0302)*(˕\u0302˕\u0303|˕\u0303˕\u0302)*˕\u0302˕

Do you claim that this argument is destroyed?  If it is irrelevant, why
is it irrelevant?  It shows that your transform does not solve the
original problem of missed matches.

> Reread carefully and use the example string I gave and don't assume I
> wanted to write u0323 instead of u0303.

I'm not at all sure what your example string is.  I ran my program to
watch its progression with input \u0323\u0323\u0302\u0302, which does
not match the pattern, and attach the outputs for your scorn.  I have
added comments started by #.

# NDE = new dead end - I could tweak the program so this state is not
entered.

# NDE! = new dead end that might not be easy to avoid.

# ODE = old dead end - derived from a state already labelled ODE or NDE.

# ODE! = old dead end - derived from a state already labelled ODE! or
NDE!.

Here are the run outputs, with blank lines added to assist formatting.

$ ./regex -b
'(\u0302\u0302\u0323)*(\u0302\u0303)*\u0302' '\u0323\u0302\u0323\u0302'
# ignore line wrapping above.

Examining /home/richard/unicode/UCD/7.00/PropertyAliases.txt.
Examining /home/richard/unicode/UCD/7.00/PropertyValueAliases.txt.
Examining /home/richard/unicode/UCD/7.00/SpecialCasing.txt.
Examining /home/richard/unicode/UCD/7.00/Scripts.txt.
Examining /home/richard/unicode/UCD/7.00/PropList.txt.
Simple Unicode regex "\u0323\u0302\u0302"
Simple ASCII regex ""                # I construct A* = (|A+)
Simple Unicode regex "\u0302\u0303"
Simple Unicode regex "\u0302"
Initial states:
 0) LLLL0
# The states are named according to a hierarchy of regexes.
# LL = regex (\u0302\u0302\u0323)*
# LLL = regex (\u0302\u0302\u0323)+
# LLLL = regex \u0302\u0302\u0323.
# This is implemented as 'Simple Unicode regex "\u0323\u0302\u0302"'.
# 0 means about to compare with character at offset 0, i.e. 0
 1) LLRM
# LLR = Empty string regex.
# M = matched
 2) LRLL0
# LR = regex (\u0302\u0303)*
# LRL = regex (\u0302\u0303)+
# LRLL = regex \u0302\u0303
 3) LRRM
# LRR = Empty string regex.
 4) R0
# R = regex \u0302
=0323=00:06:= # Get U+0323 from whole (=0) at byte 0 of argument
LLLL0 => LLLL2
LLLL0 => LLLN001220:0:L2  # NDE!

=0323=012:018:= # Note that string is input in NFD order.
LLLL2 => LLLN001220:2:L2
# Now running LLLL and LLLR, whose states have relative names 2 and L2.
# LLLR is a clone of LLL.
# This recursion enables the recognition of unrecognisable Kleene
# stars.  It makes the automaton non-finite.
# 001 is length in hex of name of relative state of LLLL
# 220 means non-starters of ccc <= 220 will not be fed to LLLL
LLLN001220:0:L2 => LLLN001220:0:N001220:2:L2 # ODE!
=0302=06:012:=
LLLN001220:2:L2 => LLLN001220:4:L2
LLLN001220:2:L2 => LLLN001230:2:L4                      # NDE
LLLN001220:2:L2 => LN00D230:LN001220:2:L2:LL2           # NDE
# L = regex (\u0302\u0302\u0323)*(\u0302\u0303)*        # NDE
LLLN001220:2:L2 => LN00D230:LN001220:2:L2:LN001230:0:L2 # NDE
LLLN001220:2:L2 => N00E230:LLN001220:2:L2:M             # NDE
LLLN001220:0:N001220:2:L2 => LLLN001230:0:N001220:4:L2  # ODE!
LLLN001220:0:N001220:2:L2 => LLLN001230:0:N001230:2:L4  # ODE!
LLLN001220:0:N001220:2:L2 => LN017230:LN001220:0:N001220:2:L2:LL2 # ODE!
LLLN001220:0:N001220:2:L2 => # Line-break is email artefact.
LN017230:LN001220:0:N001220:2:L2:LN001230:0:L2 # ODE!
LLLN001220:0:N001220:2:L2 => N018230:LLN001220:0:N001220:2:L2:M # ODE!

=0302=018:024:=
LLLN001220:4:L2 => LLLN001220:M:L2 # Redundant - should purge somehow.
LLLN001220:4:L2 => LLLL2
# Regex LLLL 'recognised' - rename LLLRL as LLLL.
LLLN001220:4:L2 => LLLN001230:4:L4                       # NDE
LLLN001220:4:L2 => LN00D230:LN001220:4:L2:LL2            # NDE
LLLN001220:4:L2 => LN00D230:LN001220:4:L2:LN001230:0:L2  # NDE
LLLN001220:4:L2 => N00E230:LLN001220:4:L2:M              # NDE
LLLN001230:2:L4 => LLLN001230:2:LM                       # ODE
LLLN001230:2:L4 => LLLN001230:2:L0                       # ODE
LLLN001230:2:L4 => LN00D230:LN001230:2:L4:LL2            # ODE
LLLN001230:2:L4 => LN00D230:LN001230:2:L4:LN001230:0:L2  # ODE
LLLN001230:2:L4 => N00E230:LLN001230:2:L4:M              # ODE
LN00D230:LN001220:2:L2:LL2 => LN00D230:LN001220:2:L2:LN001230:2:L2 # ODE
LN00D230:LN001220:2:L2:LL2 => N019230:N00D230:LN001220:2:L2:LL2:M  # ODE
LN00D230:LN001220:2:L2:LN001230:0:L2 => # Line-break is e-mail artefact
LN00D230:LN001220:2:L2:LN001230:0:N001230:2:L2           # ODE
LN00D230:LN001220:2:L2:LN001230:0:L2 => # Line-break is email artefact
N023230:N00D230:LN001220:2:L2:LN001230:0:L2:M            # ODE
LLLN001230:0:N001220:4:L2 => LLLN001230:0:N001220:M:L2   # ODE!
LLLN001230:0:N001220:4:L2 => LLLN001230:0:L2             # ODE!
LLLN001230:0:N001220:4:L2 => LLLN001230:0:N001230:4:L4   # ODE!
LLLN001230:0:N001220:4:L2 => LN017230:LN001230:0:N001220:4:L2:LL2 # ODE!
LLLN001230:0:N001220:4:L2 => # Line-break is e-mail artefact.
LN017230:LN001230:0:N001220:4:L2:LN001230:0:L2           # ODE!
LLLN001230:0:N001220:4:L2 => N018230:LLN001230:0:N001220:4:L2:M # ODE!
LLLN001230:0:N001230:2:L4 => LLLN001230:0:N001230:2:LM   # ODE!
LLLN001230:0:N001230:2:L4 => LLLN001230:0:N001230:2:L0   # ODE!
LLLN001230:0:N001230:2:L4 => LN017230:LN001230:0:N001230:2:L4:LL2 # ODE!
LLLN001230:0:N001230:2:L4 => # Line-break is e-mail artefact
LN017230:LN001230:0:N001230:2:L4:LN001230:0:L2           # ODE!
LLLN001230:0:N001230:2:L4 => N018230:LLN001230:0:N001230:2:L4:M # ODE!
LN017230:LN001220:0:N001220:2:L2:LL2 =>
LN017230:LN001220:0:N001220:2:L2:LN001230:2:L2           # ODE!
LN017230:LN001220:0:N001220:2:L2:LL2 =>
N023230:N017230:LN001220:0:N001220:2:L2:LL2:M            # ODE!
LN017230:LN001220:0:N001220:2:L2:LN001230:0:L2 =>
LN017230:LN001220:0:N001220:2:L2:LN001230:0:N001230:2:L2 # ODE!
LN017230:LN001220:0:N001220:2:L2:LN001230:0:L2 =>
N02D230:N017230:LN001220:0:N001220:2:L2:LN001230:0:L2:M  # ODE!
End marker is at 024:OVF

> And you'll see that backtracing is necessary for this case (EVEN if
> you don't care about capture groups but you are only interested in
> the global capture $0).

What I see is the desirability of some optimisation, but no problem in
principle.  Now I might see something different with your intended
example - but until I see it I think my examination would be
overwhelmed by dead-end state propagations.

If you are making the point that a backtracking automaton might need
to backtrack, then I won't dispute that point.  

Richard.



More information about the Unicode mailing list