Internationalised Computer Science Exercises

Philippe Verdy via Unicode unicode at unicode.org
Wed Jan 31 12:45:56 CST 2018


2018-01-29 21:53 GMT+01:00 Richard Wordingham via Unicode <
unicode at unicode.org>:

> On Mon, 29 Jan 2018 14:15:04 +0100
> > The case of u with diaeresis and macron is simpler: it has two
> > combining characters of the same combining class and they don't
> > commute, still the regexp to match it is something like:
> >
> > U [[:cc>0:]-[:cc=above:]]* <DIAERESIS> [[:cc>0:]-[:cc=above:]]*
> > <MACRON> [[:cc>0:]-[:cc=above:]]*
>
> <U WITH DIAERESIS AND MACRON> was meant to be an example of a searched
> string.  For example, <U WITH DIARESIS AND MACRON, COMBINING DOT BELOW>
> contains, under canonical equivalence, the substring <U WITH DIAERESIS,
> COMBINING DOT BELOW>.  Your regular expressions would not detect this
> relationship.


My regular expression WILL detect this: scanning the text implies first
composing it to "full equivalent decomposition form" (without even
reordering it, and possibly recompose it to NFD) while reading it and
bufering it in forward direction (it just requires the decomposition pairs
from the UCD, including those that are "excluded" from NFC/NFD).

The regexp exgine will then only process the "fully decomposed" input text
to find matches, using the regexp transformed as above (which has some
initial "complex" setup to "fully decompose" the initial regexp), but only
once when constructing it, but not while processing the input text which
can be then done straightforward with its full decomposition easily
performed on the fly without any additional buffering except the very small
lookahead whose length is never longer than the longest "canonical"
decompositions in the UCD, i.e.  at most 2 code points of look ahead).

The automata is of course using the classic NFA used by regexp engine (and
not the DFA which explodes combinatorially in all regexps), but which is
still fully deterministic (the current "state" in the automata is not a
single integer for the node number in the traversal graph, but a set of
node numbers, and all regexps have a finite number of nodes in the
traversal graph, this number being proportional to the length of the
regexp, it does not need lot of memory, and the size of the current "state"
is also fully bounded, never larger than the length of the regexp).
Optimizing some contextual parts of the NFA to DFA is possible (to speedup
the matching process and reduce the maximum storage  size of the "current
state") but only if it does not cause a growth of the total number of nodes
in the traversal graph, or as long as this growth of the total number does
not exceed some threshold e.g. not more than 2 or 3 times the regexp size).

In practice, most regexps never exceed several hundreds of characters
(including meta-characters of the regexp syntax itself), and the maximum
number of active nodes in the graph traversal rarely exceeds 2 or 3, so the
"current state" is not several hundreds integers, but a handful of
integers, and een if you optimize the NFA partly to DFA, you can double or
triple the number of nodes to significantly speedup the engine (in order to
reduce the number of node numbers to store in the "current state"). Some
common examples of reduction of nodes in the traversal graph is to compute
character classes, or the local expansion of "bounded non-empty
repetitions" (like in the regexp /x{m,n}/ when m>=1 and n is small).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180131/13957dd3/attachment.html>


More information about the Unicode mailing list