Regular Expressions and Canonical Equivalence
Richard Wordingham
richard.wordingham at ntlworld.com
Sat May 16 09:14:28 CDT 2015
On Sat, 16 May 2015 02:04:55 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> But do you agree that we still need to match pairs of distinct
> characters in your example ?
The original point I made was that (\u0323\u0302)*, as applied to
'traces' of Unicode strings under canonical equivalence, was only a
regular expression if one reinterpreted the *-operator.
The key points established in the theory of 'trace monoids' as applied
to fully decomposed Unicode strings are:
1) If a set ('language') A of Unicode strings under canonical
equivalence can be recognised by a *finite* state machine and, for each
string in A:
a) the string contain a starter, or
b) all characters in the string have the same canonical combining class
then there is a *finite* state machine that recognises A* with the
normal interpretation as the set of concatenations of zero or more
members of A.
2) Every set recognised by a *finite* state machine can be written in
the form of a regular expression using optionality, bracketing,
alternative, concatenation and Kleene star. Moreover, Kleene star will
only be applied to sets satisfying the condition above.
Moreover, the expression could be used to check the string as
converted to NFD.
That sounds like very good news until you remember that *searching* for
the canonical equivalent of U+00F4 in an NFD string needs something
like:
.*(o)[:^ccc = 230:]*(\u0302).*
This expression has two capture groups.
> But do you agree that we still need to match pairs of distinct
> characters in your example ?
A *finite* automaton acting on Unicode traces won't support
(\u0323\u0302)*. My preferred solution, if support is required, is to
bend the finite automaton to simultaneously consider an unbounded
number of repeats of a subexpression. This works for me because I
store states as strings and allocate each string from the heap. The
amount of memory required is sublinear in the length of the string
being searched.
> If you just count the otal it will be wrong with (\u0302\u0302\0323)*
> if you transform it into (\u0302|\u0302|\0323)* which is fully
> equivalent to (\u0302|\0323)*, because what you want is not matching
> pairs but triples (your counter check would have now to make sure
> there are two times more occurences of \u0302 and occurences of
> \u0323. If not, you need to rollback (by one or two characters,
> possibly more, until you satisfy the condition, but you won't know by
> just seeing the characters and advancing that your sequence is
> terminated: it is only at end that you have to do this check and only
> then you can rollback :
> The analysis cannot be deterministic, or it requires keeping a track
> of all acceptable positions previously seen that could satisfy the
> condition; as the sequence for (\u0302\u0302\0323)* can be extremely
> long, keeping this track for possible rollbacks coudl be costly. For
> example consider this regexp:
I don't do roll-backs. I use a non-deterministic finite automaton that
is equivalent to a deterministic finite automaton or, confronted with
this type of rational expression (it ain't regular for traces!), use a
non-deterministic slightly non-finite automaton. Now, capture groups
do destroy the finiteness of the automaton, and it looks like a matter
of trade-offs.
There is an example on the regular expression page in the ICU user
guide, searching AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC for (A+)+B.
Roll backs make this exponentially slow. My code runs through this
faster than it can display its progress, which is par for the course.
Now, my implementation of capture groups is far from complete. At
present, I capture all thousand or so possibilities, as I have no logic
to determine what is required. If I set it to capture all occurrences
of A+, I can just perceive an increase in run time when I pipe the
progress reporting to /dev/null. As I augment the recognition-related
state with the capture information, the number of active states is
quadratic with string length, and the logic to maintain the list of
states occupied is quadratic with the number of states, the
time to run varies as the fourth power of the string length.
> (\u0302\u0302\0323)*(\u0302\0303)*\u0302
>
> Can you still transform it and correctly infer the type of counters
> you need for the final check (before rollbacks) if you replace it
> with:
> <snip>
> But the counter check may still be wrong and you'll have to rollback
> through one or two occurences of \u0302 in order to find the location
> where the first iterated subregexp is satisfied. At this point the
> one ot two occurences of \u0302 that you've rolled back will be
> counted as being part of the 2nd iterated regexp, or even be the
> final occurence looked to match the end of the regexp.
> I don't see how you can support this regexp with a DFA, you
> absolutely need an NFA (and the counters you want to add do not offer
> any decisive help).
The reinterpretation of this expression as a regular expression
for traces substitutes 'concurrent iteration' for Kleene star. Each
trace in the bracketed expression that lacks a character with ccc=0 is
replaced by its maximal subtraces of each canonical combining class.
Under this scheme, (\u0302\u0302\0323)*(\u0302\0303)*\u0302 would be
interpreted as (\u0302\u0302|0323)*(\u0302\0303)*\u0302. As I said
before, that is unlikely to be what the user means by expressions like
these.
I'm not sure what you mean by 'NFA'. Do you mean a 'back-tracking
automaton'?
To support
(\u0302\u0302\u0323)*(\u0302\u0303)*\u0302
I would use my extension of non-deterministic finite automaton to
process it. If you like, that is how I would do the counting - by the
number of incomplete matches to \u0323\u0302\u0302. Note that I take
characters from the input string in NFD order, along with their
(fractional) positions in the input string. I process
\u0302\u0302\u0323 as a very simple regex \u0323\u0302\u0302. The
state is simply the position of the next character, plus two states for
'all matched' and for 'cannot match'.
Running it with input \u0323\u0302\u0302\u0302, which it did recognise,
did show one problem. Me engine doesn't notice that when looking for
the first factor, \u0323\u0302\u0302, it is not possible for \u0302 to
belong to a subsequent factor. Instead it progresses to a
dead-end state where all subsequent input is assumed to be
part of another factor. Supporting Unicode properties may make fixing
this messy. I had been living with this because these dead-end states
are killed on receipt of a starter, and runs of non-starters are
normally not very long. No precomposed character decomposes to more
than three of them. I saw the need as being for something that runs
correctly, rather than for something that runs correctly and fast.
When checking whether a string matches, once I have fixed the problem
of dead-end states, there will be, for each state, one capturing group
for the last U+0323 encountered and, at most, one capturing group for
the last or penultimate U+0302 encountered. While the number of states
is unbounded, the number of possible states at any point is
uniformly bounded.
Searching for the pattern is a bit more complicated, as each U+0323
or U+0302 could be the last such character in a matching subtrace.
Richard.
More information about the Unicode
mailing list