Regexes, Canonical Equivalence and Backtracking of Input
Richard Wordingham
richard.wordingham at ntlworld.com
Mon May 18 19:44:17 CDT 2015
On Tue, 19 May 2015 01:25:54 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> I don't work with strings, but with what you seem to call "traces",
For the concept of traces, Wikipedia suffices:
https://fr.wikipedia.org/wiki/Mono%C3%AFde_des_traces .
As far as text manipulation is concerned, the word 'trace' is an
idealisation of how Latin text is written. Base letters advance the
writing point, so they commute with nothing - canonical combining class
0. Ideally, marks of different canonical combining classes do not
interact with one another when writing, so they commute. In general,
marks of the same canonical combining class interact with one another,
be it only to move the subsequent one further from the base letter, so
they do not commute.
The traces I refer to are the equivalence classes of Unicode modulo
canonical equivalence. To apply the theory, I have to regard
decomposable characters as notations for sequences of 1 to 4
indecomposable characters. The notion works with compatibility
equivalence, and one could use a stronger notion of equivalence, so
that compatibility ideographs did not have singleton decompositions.
Thus, as strings, \u0323\u0302 and \u0302\u0323 are distinct, but as
traces, they are identical.
The lexicographic normal form that is most useful is simply NFD. The
indecomposable characters are ordered by canonical combining class and
then it doesn't matter; one may as well use codepoint.
> but that I call sets of states (they are in fact bitsets, which may be
> compacted or just stored as arrays of bytes containing just 1 usefull
> bit, but which may be a bit faster; byte arrays are just simpler to
> program)., in a stack (I'll use bitsets later to make the structure
> more compact, if needed, but for now this is fast enough and not
> memory intensive even for large regexps with many repetitions with
> "+/*/{m,n}" or variable parts).
Your 'bitset' sounds like a general purpose type, and to be an
implementation detail that surfaces in your discussion.
> The internal matcher uses NFD, but
> needs to track the positions in the original buffered input for
> returning captured matches.
That's how I'm working. I do not regard decomposable characters as
atomic; I am emotionally happy with working with fractions of
characters.
> ... Greek, Cyrillic and Arabic, but also too few for Hebrew where
> "pathological" cases of regexps are certainly more likely to occur
> than in Latin, even with Vietnamese and its frequent double
> diacritics).
I was just thinking respecting canonical equivalence might be very
useful for Hebrew, particularly when dealing with text with accents.
> Finally a question:
>
> I suppose that like many programmers you have read the famous "green
> dragon" book of Sethi/Aho/Ullman books about compilers. I can
> understand the terminology they use when spoeaking about automatas
> (and that is found in many other places), but apparently you are
> using some terms that I have to guess from their context.
No, I started off by hunting the web to try and work out what was
special about a regular expression, and found the articles in
Wikipedia quite helpful. When working out how to make matching
respect canonical compliance, I started out with normalising strings
to NFD. Only after I had generalised the closure properties of
regular languages from strings to these representative forms (with the
exception of Kleene star) did I finally discover what I had long
suspected, that I was not the first person to investigate regular
expressions on non-free monoids. What does surprise me is that I
cannot find any evidence that any one else has made the connection
between trace monoids and Unicode strings under canonical equivalence.
I would like update the article on the trace monoid with its most
important example, Unicode strings under canonical equivalence, but,
alas, that seems to be 'original research'!
I'm beginning to think that 'letting the regex choose the input
character' might be a better method of dealing with interleaving of
subexpressions even for 'non-deterministic' engines, i.e. those which
follow all possible paths in parallel. I'll have to compare the
relevant complexities.
> Good books on the subjext are now becoming difficutlt to find (or
> they are more expensive now), and too difficult to use on the web
> (for such very technical topics, it really helps to have a printed
> copy, that you an annotate, explore, or have beside you instead of on
> a screen (and printing ebooks is not an option if they are
> voluminous). May be you have other books to recommend.
Google Books, in English, gives access to a very helpful chapter on
regular languages in trace monoids in 'the Book of Traces'.
I found Russ Cox's Internet notes on regular expressions helpful, though
not everyone agrees with his love of non-determinism.
Richard.
More information about the Unicode
mailing list