Internationalised Computer Science Exercises

Richard Wordingham via Unicode unicode at
Mon Jan 29 14:53:05 CST 2018

On Mon, 29 Jan 2018 14:15:04 +0100
Philippe Verdy via Unicode <unicode at> wrote:

> No since the begining we were talking about matching strings that are
> canonically equivalent within regexps. So that searching for a regexp
> containing precombined characters or decomposed characters would find
> them independantly of the encoded form (normalized or not) of the
> input and independantly that there are addtional combining characters
> inserted between them.

OK, we're taking different approaches.  Given finite automata for
recognising NFD strings matching regular languages A and B of traces, I
know how to construct a non-deterministic finite automaton for
recognising any of AB, A and B, A or B, and, if itself a regular
language, A*, where the sets denoted are traces.  For AB, the states I
keep track of are states of A, states of B, and A × 255 × B, where the
2nd coordinate of the latter is the ccc of the latest element of the
searched string used to propagate a state of B.  If I didn't normalise
the searched string, I'd have to keep a list of ccc's of characters used
in propagating states of B.  That gets complicated with A*, for which in
theory I need the simultaneous progression of 255 FSMs (OK, only
about 52 at the moment).  I actually treat A* as A(A*), where no
capture is implied by the parentheses.  Theory says that if NFD([A])
is a regular language (where [x] is the set of strings in the canonical
equivalence class x), then A is a regular language. However,
constructing a finite automaton to check for matches may not be
straightforward.  I prefer to sacrifice the purity of finiteness by
buffering enough of the searched string to convert it to NFD on the fly.

As an example of the complexity, consider checking whether a string
was composed of pairs of identical combining characters.

> The case of u with diaeresis and macron is simpler: it has two
> combining characters of the same combining class and they don't
> commute, still the regexp to match it is something like:
> U [[:cc>0:]-[:cc=above:]]* <DIAERESIS> [[:cc>0:]-[:cc=above:]]*
> <MACRON> [[:cc>0:]-[:cc=above:]]*  

<U WITH DIAERESIS AND MACRON> was meant to be an example of a searched
contains, under canonical equivalence, the substring <U WITH DIAERESIS,
COMBINING DOT BELOW>.  Your regular expressions would not detect this


More information about the Unicode mailing list