Regular Expressions and Canonical Equivalence
richard.wordingham at ntlworld.com
Thu May 14 14:29:06 CDT 2015
On Thu, 14 May 2015 07:08:14 -0700
"Doug Ewell" <doug at ewellic.org> wrote:
> Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:
> > For example, I believe that one should be able to find
> > [...]
> > the Vietnamese letter ô U+00F4 LATIN SMALL LETTER O WITH
> > CIRCUMFLEX in the word _buộc_ 'to bind' <U+0062, U+0075, U+1ED9
> > LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW, U+0063>. As
> > far as I can tell, U+1ED9 is not a letter of the Vietnamese
> > alphabet; it is the combination <U+00F4 LATIN SMALL LETTER O WITH
> > CIRCUMFLEX, U+0323 COMBINING DOT BELOW> of Vietnamese letter and
> > tone mark.
> What you're looking for in this case is neither an NFC match nor an
> NFD match, but a language-dependent match, as you imply further down.
> <1ED9> decomposes to <006F 0323 0302>, and if you want a match with
> <00F4>, which decomposes to <006F 0302>, your regex engine has to
> reorder the marks. It sounds unlikely that you'll find such an
> engine, but there is a lot of Vietnamese-language–specific software
> out there, so you never know.
There's no more reordering than is involved in doing a Vietnamese
collation-based search, where one has to split <006F 0323 0302> up into
collating elements <006F 0302><0323>.
Possibly a back-tracking regular expression would reorder the string.
My experimental canonical-equivalence respecting regular
expression engine is designed in the same manner as the Thomson
construction - it is a non-deterministic finite automaton (except for
the effects of capturing parts of the input string) composed of a
hierarchy of non-deterministic finite automata.
States are identified as
strings of scalars following the hierarchy. The engine checks whether a
string matches a regular expression.
The engine decomposes the string to NFD. This keeps the automaton for
the concatenation of two regular expressions simple.
I will now show how it handles the search. The regular expression to
match against is \u00f4.* - a character U+00F4 followed by anything,
My program essentially produced the following output, with
comments added later indicated by #:
$ ./regex '\u00f4.*' ộ # Arguments are regular expression and string
Simple Unicode regex "\u006F\u0302" # First half of regular expression
# as the automaton actually sees it.
0) L0 # Initial state - expecting 'o', in first half of expression.
# 'L' = left.
=o=10:20:= # Gets 'o'
L0 => L1 # Changes state to expecting combining circumflex
=0323=20:30:= # Gets combining dot below (U+0323)
L1 => N001220:1:* # N => State for concatenation of regular
# expressions; both automata are run.
# 001 => Substring length within state identifier.
# 220 => Combining class of U+0323. Characters with
# this ccc or lower may no longer be processed by
# the left-hand automaton.
# : is punctuation for readability of state
# 1 => Left half still expecting combining circumflex
# * => only state for regex ".*".
=0302=30:06:= # Gets combining circumflex (U+0302)
# The engine runs a non-deterministic finite
# automaton. It now branches to 3 states.
N001220:1:* => N001220:M:* # Left half has now reached end of expected
N001220:1:* => R* (match) # On transferring to an accept state of
# \u00f4 automaton, only the .*
# automaton needs to # be processed.
N001220:1:* => N001230:1:* # Possibly the combining circumflex is to
# match the '.*'. The combining class is
# updated to 230. This 230 will actually
# block the \u00f4 automaton from reaching
# an accept state from this state, for a
# combining circumflex can henceforth only be
# considered by the .* automaton.
At no point has the input string been reordered.
> Doug Ewell | http://ewellic.org | Thornton, CO
More information about the Unicode