Internationalised Computer Science Exercises

Philippe Verdy via Unicode unicode at
Sun Jan 28 13:29:28 CST 2018

2018-01-28 5:12 GMT+01:00 Richard Wordingham via Unicode <
unicode at>:

> On Sat, 27 Jan 2018 14:13:40 -0800The theory
> of regular expressions (though you may not think that mathematical
> regular expressions matter) extends to trace monoids, with the
> disturbing exception that the Kleene star of a regular language is not
> necessarily regular.  (The prototypical example is sequences (xy)^n
> where x and y are distinct and commute, i.e. xy and yx are canonically
> equivalent in Unicode terms.  A Unicode example is the set of strings
> composed only of U+0F73 TIBETAN VOWEL SIGN II - there is no FSM that
> will recognise canonically equivalent strings).

I don't see why you can't write this as the regular expression:
  (x | y)*
For the Unicode canonical equivalences, this applies to distinct characters
that have distinct non-zero combining classes.

But of course searching for


requires transforming it to NFD first as:


so thet the regexp transforms to:

  <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] *
  ( <COMBINING CIRCUMFLEX> * [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]
  |  <COMBINING DOT BELOW> * [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]

Note that the "complex" set of characters used three times above is finite,
it contains all combining characters of Unicode that have a non-zero
combining class except above and below, i.e. all Unicode characters whose
combining class is not 0, 220 (below) or 230 (above).

However, It is too simplified, because the allowed combining classes must
occur at most once in each possible non-zero combining class and not
arbitrary numbers of them: these allowed combining classs currently are in
{1, 7..36, 84, 91, 103, 107, 118, 122, 129, 130, 132, 202, 214, 216, 218,
222, 224, 226, 228, 232..234, 240} whose most member elements are used for
very few combining characters (the above and below combining classes are
the most populated ones but we exclude them, all the others have 1 to 9
combining characters assigned to them, or 22 characters with cc=7 (nukta),
or 32 characters with cc=1 (overlay), or 47 characters with cc=9 (virama).

Once again we can refine them also as a regexp, but this is combinatorial
because we have 52 combining classes (so we would need to consider the 52!
(factorial) alternates). But given the maximum length of what this can
match (0 to 52 combining characters: yes it is finite), this is best done
by not rewriting this as a regexp but by replacing the final "*" by {1,52},
and then just check each returned match of

  [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]{0,52}

with a simple scan of these short strings to see that they all have
distinct combining classes (this just requires 52 booleans, easily stored
in a single 64 bit integer initialized to 0 prior to scan the scan of these
small strings). But the theory does not prevent writing it as a regexp
(even if it would be extremely long). So a  Kleene Star closure is possible
and can be efficiently implemented (all depends on the way  you represent
the "current state" in the FSM: a single integer representing a single node
number in the traversal graph is not the best way to do that.

This is a valid regexp, the finite state machine DOES have a finite
lookahead (the full regexp above will match AT MOST 107 characters
including the combining marks, where 107=3+2*52), but this is general to
regexps that generally cannot be transformed directly into a FSM with
finite lookahead, but a FSM is possible: the regexp first transforms into a
simple graph of transitions with a finite number of node (this number is
bound to the length of the regexp itself) where there can be multiple
states active simultaneously; then a basic transform converts this small
graph by transforming nodes into new nodes representing the finite set of
the combinations of active states in the first graph :

There will be many more nodes, and generally this explodes in size because
the transform is combinatorial, and such size explosion has worst
perfomance (explosion of the memory needed to representing the new graph
with a single state active). So regexp engines use the alternative by
representing the current state of traversal of the first simple graph using
a stack of active states and transiting them all separately (this can be
implemented with a "bitset" whose size in bits is the number of states in
the first simple graph, or by using an associative array (dictionnary of
boolean properties whose keys are state numbers in the first graph, which
can be set  or removed: this requires much less memory and it is relatively
fast, even if the full current state is not just a single small integer.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list