Regular Expressions and Canonical Equivalence

Philippe Verdy verdy_p at wanadoo.fr
Fri May 15 17:31:53 CDT 2015


2015-05-15 23:57 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Fri, 15 May 2015 22:09:13 +0200
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> > 2015-05-15 9:10 GMT+02:00 Richard Wordingham <
> > richard.wordingham at ntlworld.com>:
>
> > This is because you don't understand the issue !
>
> > > Now, a program to check whether a trace matching
> > > {\u0323|\u0302)* matches (\u0323\u0302)* is very simple.  It just
> > > counts the number of times \u0323 occurs and the number of times
> > > \u0302 occurs, and returns whether they are equal.
>
> > This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would
> > pass your counting test (which does not work in a FSA) but they are
> > NOT canonically equivalent because the identical combining characters
> > are blocking each other (so arbitrary ordering is not possible).
>
> TUS7.0: D108   Reorderable pair:
>  Two adjacent characters A and B in a coded character sequence
>  <A, B> are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0.
>
> Now, ccc(U+0302) = 230 > 220 = ccc(U+0323) > 0, so (U+0302, U+0303) is
> a reorderable pair.
>

I do NOT contest that U+0323 and U+0302 can reorder, but the fact that
U+0323 blocks another occurence of U+0323 because it has the **same**
combining class.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150516/1790bac9/attachment.html>


More information about the Unicode mailing list