Regular Expressions and Canonical Equivalence

Richard Wordingham richard.wordingham at ntlworld.com
Thu May 14 13:13:24 CDT 2015


On Thu, 14 May 2015 12:58:29 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-05-14 9:59 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > An elegant formal solution to the Kleene star problem interprets
> > (\u0323\u0302)* as (\u0323|\u0302)*.  However, that is
> > counter-intuitive

The technical term for this is the 'concurrent iteration' - or at
least, that's the term used in the 'Book of Traces'.

> For your example "(\u0323\u0302)*" the characters in the alternatives
> (COMBINING DOT BELOW and COMBINING ACUTE ACCENT), once converted to
> NFD (which is the same here) are just using at most two distinct
> non-zero combining classes and no blocker; so it is safe to trasform
> it to (\u0323|\u0302)* for a first pass matching that will then only
> check candidate matchs in the second pass. or more efficiently, a
> second finite state automata (FSA) running in parallel with its own
> state:

You've forgotten the basic problem.  A *finite* state automaton cannot
count very far; with only n states, it cannot count as far as n.

For this simple example, one could simply use something like
(\u0323\u0302)\{0,7\}, which should be more than enough for any likely
occurrences.  It's an interesting challenge, but I think solving it
provides satisfaction rather than practical benefit.

> However, the most difficult part for regexps supporting canonical
> equivalence is about what we can do to return submatches! they are not
> necessarily contiguous in the input stream. You can still return a
> matching substring but if you use it for performing search/replace
> operations, it becomes difficult to know where to place the
> replacement, when that replacement string (even if it was converted
> first to NFD) may also contain combining characters. Or even worse if
> the replacement contains some blockers that will be inserted in the
> middle of the non-replaced text (wnad where can we safely place the
> remaining characters in the middle of the match but that are not part
> of the match itself ???

Interestingly, ICU hides that detail from the user.

For search and replace on a text buffer, the text to be replaced would
be defined by a list of text intervals.  If the text is unnormalised,
some of the boundaries may divide precomposed characters.

If the interval list is compacted, at most one of the intervals will
contain a character properly having combining class 0.  (U+0F73 and
U+0F75 do not count.)  If there is such an interval, it will be
replaced and the others simply deleted.  If there is no such interval,
then the choice of insertion point may be more difficult.  Indeed, in
some cases, it could be appropriate to reject the replacement command
as undefined in the context.  On the other hand, if the text buffer is
normalised, then one would be able to have well-defined behaviour, as
one does when splitting text into UCA collating elements.

Richard.


More information about the Unicode mailing list