Regular Expressions and Canonical Equivalence
Richard Wordingham
richard.wordingham at ntlworld.com
Fri May 15 16:57:03 CDT 2015
On Fri, 15 May 2015 22:09:13 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 2015-05-15 9:10 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> This is because you don't understand the issue !
> > Now, a program to check whether a trace matching
> > {\u0323|\u0302)* matches (\u0323\u0302)* is very simple. It just
> > counts the number of times \u0323 occurs and the number of times
> > \u0302 occurs, and returns whether they are equal.
> This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would
> pass your counting test (which does not work in a FSA) but they are
> NOT canonically equivalent because the identical combining characters
> are blocking each other (so arbitrary ordering is not possible).
TUS7.0: D108 Reorderable pair:
Two adjacent characters A and B in a coded character sequence
<A, B> are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0.
Now, ccc(U+0302) = 230 > 220 = ccc(U+0323) > 0, so (U+0302, U+0303) is
a reorderable pair.
TUS7.0: D109 Canonical Ordering Algorithm:
In a decomposed character sequence D, exchange the positions of the
characters in each Reorderable Pair until the sequence contains no
more Reorderable Pairs.
The normalisation process on <U+0323, U+0302, U+0323, U+0302> first
replaces it by <U+0323, U+0323, U+0302, U+0302>. There are then no
more reorderable pairs, so that has reduced it to form NFD.
Therefore <U+0323, U+0323, U+0302, U+0302> and <U+0323, U+0302, U+0323,
U+0302> *are* canonically equivalent.
> So the FSA implementation is perfectly possible, for canonical
> equivalences only...
I now vaguely follow your argument, but it depends on the erroneous
claim that <U+0323, U+0323, U+0302, U+0302> and <U+0323, U+0302, U+0323,
U+0302> are not canonically equivalent.
> This evidently does not work if you are
> performing regexp searches using looser equivalences, such as
> compatibility equivalence.
I completely fail to understand this remark; it makes no difference
whether one uses canonical or compatibility equivalence.
Richard.
More information about the Unicode
mailing list