Internationalised Computer Science Exercises

Wed Jan 31 19:38:58 CST 2018

On Wed, 31 Jan 2018 19:45:56 +0100
Philippe Verdy via Unicode <unicode at unicode.org> wrote:

> 2018-01-29 21:53 GMT+01:00 Richard Wordingham via Unicode <
> unicode at unicode.org>:  

> > On Mon, 29 Jan 2018 14:15:04 +0100  
> > <U WITH DIAERESIS AND MACRON> was meant to be an example of a
> > searched string.  For example, <U WITH DIARESIS AND MACRON,
> > COMBINING DOT BELOW> contains, under canonical equivalence, the
> > substring <U WITH DIAERESIS, COMBINING DOT BELOW>.  Your regular
> > expressions would not detect this relationship.

> My regular expression WILL detect this: scanning the text implies
> first composing it to "full equivalent decomposition form" (without
> even reordering it, and possibly recompose it to NFD) while reading
> it and bufering it in forward direction (it just requires the
> decomposition pairs from the UCD, including those that are "excluded"
> from NFC/NFD).

No.  To find <LATIN SMALL LETTER A WITH CIRCUMFLEX AND COMBINING DOT
BELOW>, you constructed, on "Sun, 28 Jan 2018 20:30:44 +0100":

  <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]
  ]] * ( <COMBINING CIRCUMFLEX> [[ [^[[:cc=0:]]] -
  [[:cc=above:][:cc=below:]] ]]
* <COMBINING DOT BELOW>
  |  <COMBINING DOT BELOW> [[ [^[[:cc=0:]]] -
  [[:cc=above:][:cc=below:]] ]]
* < COMBINING  CIRCUMFLEX>

To be consistent, to find <U WITH DIAERESIS, COMBINING DOT BELOW> you
would construct

  <LATIN SMALL LETTER U>
  [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]]] * 
 ( <COMBINING DIAERESIS>
   [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]* 
   <COMBINING DOT BELOW>
 | <COMBINING DOT BELOW>
   [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]*
   <COMBINING DIAERESIS>
 )

(A final ')' got lost between brain and text; I have restored it.)

However, <U WITH DIARESIS AND MACRON, COMBINING DOT BELOW> decomposes to
<LATIN SMALL LETTER U, COMBINING DIAERESIS, COMBINING MACRON, COMBINING
DOT BELOW>.  It doesn't match your regular expression, for between COMBINING
DIAERESIS and COMBINING DOT BELOW there is COMBINING MACRON, for which
ccc = above!

> The regexp exgine will then only process the "fully decomposed" input
> text to find matches, using the regexp transformed as above (which
> has some initial "complex" setup to "fully decompose" the initial
> regexp), but only once when constructing it, but not while processing
> the input text which can be then done straightforward with its full
> decomposition easily performed on the fly without any additional
> buffering except the very small lookahead whose length is never
> longer than the longest "canonical" decompositions in the UCD, i.e.
> at most 2 code points of look ahead).

Nitpick: U+1F84 GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND
YPOGEGRAMMENI decomposes to <U+03B1, U+0313, U+0301, U+0345>.

Conversion to NFD on input only requires a small buffer for natural
orthographies.  I suspect the worst in natural language will come from
either narrow IPA transcriptions or Classical Greek. 

> The automata is of course using the classic NFA used by regexp engine
> (and not the DFA which explodes combinatorially in all regexps), but
> which is still fully deterministic (the current "state" in the
> automata is not a single integer for the node number in the traversal
> graph, but a set of node numbers, and all regexps have a finite
> number of nodes in the traversal graph, this number being
> proportional to the length of the regexp, it does not need lot of
> memory, and the size of the current "state" is also fully bounded,
> never larger than the length of the regexp). Optimizing some
> contextual parts of the NFA to DFA is possible (to speedup the
> matching process and reduce the maximum storage  size of the "current
> state") but only if it does not cause a growth of the total number of
> nodes in the traversal graph, or as long as this growth of the total
> number does not exceed some threshold e.g. not more than 2 or 3 times
> the regexp size).

In your claim, what is the length of the regexp for searching for ậ in a
trace? Is it 3, or is it abut 14?  If the former, I am very interested
in how you do it.  If the latter, I would say you already have a form
of blow up in the way you cater for canonical equivalence.

Even with the dirty trick of normalising the searched trace for input
(I wanted the NFA propagation to be defined by the trace - I also didn't
want to have to worry about the well-formedness of DFAs or NFAs), I
found that the number of states for a concatenation of regular
languages of traces was bounded above by the product of the number of
states.  This doesn't strike me as inherently unreasonable, for I get
the same form of bound for the intersection of regular languages even
for strings. In both cases, a lot of the nodes for the concatenation or
intersection are unreachable.

Kleene star is a problem on size.  I think there is a polynomial bound
for when A* is a regular language.  If I substitute 'concurrent star'
for Kleene star, which has the nice property that the concurrent star
of a regular trace language is itself regular, then the bound I have on
the number of states of the concurrent star is proportional to the
third power of the number of states for the original trace language.
The states are fairly simply derived from the states for recognising
the regular language A.

(My size bounds are for the trace of fully decomposed Unicode character
strings under canonical equivalence.  I am not sure that they hold for
arbitrary traces.)

I believe the concurrent star of a language A is (|A|)*, where

|A| = {x ∊ A : {x}* is a regular language}

(The definition works for the trace of fully decomposed Unicode
character strings under canonical equivalence.)

Concurrent star is not a perfect generalisation.  If ab = ba, then
X = {aa, ab, b} has the annoying property that X* is a regular trace
language, but |X|* is a proper subset of X*.  For Unicode, X would be a
rather unusual regular language.

Richard.