Regular Expressions and Canonical Equivalence
Richard Wordingham
richard.wordingham at ntlworld.com
Thu May 14 02:59:59 CDT 2015
On Thu, 14 May 2015 01:31:29 +0100
Richard Wordingham <richard.wordingham at ntlworld.com> wrote:
> I believe this corresponds to the
> intent of requirement RL2.1 that was in UTS#18 Unicode Regular
> Expression until the towel was thrown in and the paragraph survived
> but the requirement vanished.
I apologise if I am telling those interested what they already know. I
couldn't find it written down in terms of NFD strings.
I believe the core of the problem is that Thompson's construction
algorithm has to be significantly elaborated for concatenation. When
running the non-deterministic finite state machine for the regular
expression st, if the string is amnb with ccc(m) != ccc(n), one has to
consider the possibility that subsequence an matches expression s and
subsequence mb matches expression t. To handle a run of decomposed
characters with non-zero canonical combining class, one method
adds states of the form (x,y,n) where x is a state of for expression s,
y is a state for expression t, and n is the non-zero canonical combining
class of the last character received.
The additional problem with (algebraic) Kleene star is that for s* one
has to simultaneously consider s, ss, sss and so on, which makes the
state machine non-finite. This is probably just a formal problem; once
one adds capture groups to the FSM, the memory requirement depends on
the size of the string being examined. A solution is to effectively
add a loop to the parse structure of the regular expression and add
checks to the matching function to avoid unnecessary recursion.
An elegant formal solution to the Kleene star problem interprets
(\u0323\u0302)* as (\u0323|\u0302)*. However, that is
counter-intuitive, and simply rejecting such expressions would probably
be better. Going non-finite is probably better. My *finite* state
machine bodge for these cases is to simply match s+ to something
uncharacterised between s|ss and s+.
Richard.
More information about the Unicode
mailing list