Regular Expressions and Canonical Equivalence

Richard Wordingham richard.wordingham at ntlworld.com
Thu May 14 14:29:06 CDT 2015


On Thu, 14 May 2015 07:08:14 -0700
"Doug Ewell" <doug at ewellic.org> wrote:

> Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:
> 
> > For example, I believe that one should be able to find
> > [...]
> > the Vietnamese letter ô U+00F4 LATIN SMALL LETTER O WITH
> > CIRCUMFLEX in the word _buộc_ 'to bind' <U+0062, U+0075, U+1ED9
> > LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW, U+0063>.  As
> > far as I can tell, U+1ED9 is not a letter of the Vietnamese
> > alphabet; it is the combination <U+00F4 LATIN SMALL LETTER O WITH
> > CIRCUMFLEX, U+0323 COMBINING DOT BELOW> of Vietnamese letter and
> > tone mark.
> 
> What you're looking for in this case is neither an NFC match nor an
> NFD match, but a language-dependent match, as you imply further down.
> <1ED9> decomposes to <006F 0323 0302>, and if you want a match with
> <00F4>, which decomposes to <006F 0302>, your regex engine has to
> reorder the marks. It sounds unlikely that you'll find such an
> engine, but there is a lot of Vietnamese-language–specific software
> out there, so you never know.

There's no more reordering than is involved in doing a Vietnamese
collation-based search, where one has to split <006F 0323 0302> up into
collating elements <006F 0302><0323>. 

Possibly a back-tracking regular expression would reorder the string.

My experimental canonical-equivalence respecting regular
expression engine is designed in the same manner as the Thomson
construction - it is a non-deterministic finite automaton (except for
the effects of capturing parts of the input string) composed of a
hierarchy of non-deterministic finite automata.

States are identified as
strings of scalars following the hierarchy. The engine checks whether a
string matches a regular expression.

The engine decomposes the string to NFD.  This keeps the automaton for
the concatenation of two regular expressions simple.

I will now show how it handles the search.  The regular expression to
match against is \u00f4.* - a character U+00F4 followed by anything,
including nothing.

My program essentially produced the following output, with
comments added later indicated by #:

$ ./regex  '\u00f4.*' ộ # Arguments are regular expression and string

Simple Unicode regex "\u006F\u0302" # First half of regular expression
                                    # as the automaton actually sees it.

Initial states: 

0) L0   # Initial state - expecting 'o', in first half of expression.
        # 'L' = left.

=o=10:20:=         # Gets 'o'

L0 => L1           # Changes state to expecting combining circumflex

=0323=20:30:=      # Gets combining dot below (U+0323)

L1 => N001220:1:*  # N => State for concatenation of regular
                   # expressions; both automata are run.
                   # 001 => Substring length within state identifier.
                   # 220 => Combining class of U+0323.  Characters with
                   # this ccc or lower may no longer be processed by
                   # the left-hand automaton.
                   # : is punctuation for readability of state
                   # 1 => Left half still expecting combining circumflex
                   # * => only state for regex ".*".  

=0302=30:06:=      # Gets combining circumflex (U+0302)
                   # The engine runs a non-deterministic finite
                   # automaton.  It now branches to 3 states.

N001220:1:* => N001220:M:* # Left half has now reached end of expected
                           # string.

N001220:1:* => R* (match)  # On transferring to an accept state of
                           # \u00f4 automaton, only the .*
                           # automaton needs to # be processed.

N001220:1:* => N001230:1:* # Possibly the combining circumflex is to
                           # match the '.*'.  The combining class is
                           # updated to 230.  This 230 will actually
                           # block the \u00f4 automaton from reaching
                           # an accept state from this state, for a
                           # combining circumflex can henceforth only be
                           # considered by the .* automaton.

At no point has the input string been reordered.

Richard.








 

> 
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ����
> 
> 




More information about the Unicode mailing list