Regular Expressions and Canonical Equivalence

Wed May 13 19:31:29 CDT 2015

What is the current state of play on regular expression engines that
acknowledge canonical equivalence?  By acknowledge, I mean that will
deem a string to have a match for a pattern if any string canonically
equivalent to the string does.  I believe this corresponds to the
intent of requirement RL2.1 that was in UTS#18 Unicode Regular
Expression until the towel was thrown in and the paragraph survived but
the requirement vanished.

I have been putting my own together, but my efforts have bogged down
with how to select the match and subexpression matches to report.  The
relevant theory is not of regular languages of strings, but of regular
languages of 'traces'.  I currently leave the results undefined if
an algebraic Kleene star is not a regular expression, e.g.
(\u0323\u0301)*.

It is particularly relevant to using regular expressions for text
rendering, e.g. for something like an imitation of Microsoft’s
Universal Shaping Engine.

I note that ICU is having another attempt at supporting canoncial
equivalence - http://bugs.icu-project.org/trac/ticket/9111 'Support
UREGEX_CANON_EQ'.  At least, they are if the User Guide
(http://userguide.icu-project.org/strings/regexp) is to be believed.
Perhaps not, though, if the old comments in the ticket are taken
seriously.

For example, I believe that one should be able to find the Lanna script
subscript nga <U+1A60 TAI THAM SIGN SAKOT, U+1A26 TAI THAM LETTER NGA>
in the word ᨠᩮᩥ᩠᩵ᨦ <koeng> /kɤŋ/ 'half' <U+1A20 TAI THAM LETTER HIGH KA,
U+1A6E TAI THAM VOWEL SIGN E, U+1A65 TAI THAM VOWEL SIGN I, U+1A75 TAI
THAM SIGN TONE-1, U+1A60 TAI THAM SIGN SAKOT, U+1A26 TAI THAM LETTER
NGA> or the Vietnamese letter ô U+00F4 LATIN SMALL LETTER O WITH
CIRCUMFLEX in the word _buộc_ 'to bind' <U+0062, U+0075, U+1ED9 LATIN
SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW, U+0063>.  As far as I can
tell, U+1ED9 is not a letter of the Vietnamese alphabet; it is the
combination <U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX, U+0323
COMBINING DOT BELOW> of Vietnamese letter and tone mark.  One will not
find them if one simply applies the string theory of regular
expressions to NFD equivalents, as the initial bug report in
the ticket suggests doing. A later comment in the ticket suggests that
the alphabet for the string theory should be 'the combining
sequences'.  (I hope there is no theoretical problem from there being
an infinite number of them.)  The Vietnamese search would work if the
alphabet in the string theory were *Vietnamese* collation elements.

In the text rendering domain, HarfBuzz makes regular expressions work
with conversion to NFD by permuting the canonical combining classes
on a script by script basis.  This requires care.

Richard.