Regular Expressions and Canonical Equivalence

Richard Wordingham richard.wordingham at ntlworld.com
Sat May 16 10:02:39 CDT 2015


On Sat, 16 May 2015 02:41:33 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> With a NFA, the representation is completely different, The regexp
> 
> (\u0302\u0302\0323)*(\u0302\0303)*\u0302
> 
> is just transformed into:
> 
> (˕\u0302˕\u0302˕\0323|˕\u0302˕\0323˕\u0302|˕\u0302˕\u0302˕\0323)*(˕\u0302˕\
> 0303|˕\0303˕\u0302)*˕\u0302˕
>
> where I noted with the "tack" the 15 relative positions **in this new
> regex** where there's a need to check if the input matches a
> character or character class.

The old regex is a pattern for use with the trace monoid of Unicode
strings under canonical equivalence.

>From its appearance, I presume the new regex is intended for use with
strings, and that the third run of codepoints is meant
to be ˕\u0323˕\u0302˕\u0302 rather than a repeat of
˕\u0302˕\u0302˕\0323.

There is an annoying error.  You appear to assume that U+0302 COMBINING
CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute, but they don't;
they have the same combining class, namely 230.  I'm going to assume
that 0303 is a typo for 0323.

\u0323\u0323\u0302\u0302\u0302\u0302 is canonically equivalent to
\u0302\u0302\u0323\u0302\u0323\u0302, which clearly matches the
corrected old regex (\u0302\u0302\u0323)*(\u0302\u0323)*\u0302.
However, \u0323\u0323\u0302\u0302\u0302\u0302 does not match the
corrected new regex
(˕\u0302˕\u0302˕\u0323|˕\u0302˕\0323˕\u0302|˕\u0323˕\u0302˕\u0302)*(˕\u0302˕\u0323|˕\u0323˕\u0302)*˕\u0302˕

This example goes straight to the problem with the recommended way of
using string-based regular expression engines.  Using NFD throughout
works fine if one is working with whole words.  If fails if one is
working with sequences of combining marks and there is any complexity.

> But it's true that the allowed reorderings implied by canonical
> equivalences (and those that are NOT allowed because they are
> blocked) are really challenging !

They are not challenging at all.  Once you have eliminated the
precomposed characters and characters with singleton decompositions,
you are left with the trace monoid of Unicode strings under canonical
equivalence.  All you have to remember is that two characters commute if
and only if they have different positive canonical combining classes. 

Richard.



More information about the Unicode mailing list