Internationalised Computer Science Exercises
Richard Wordingham via Unicode
unicode at unicode.org
Thu Feb 1 13:20:04 CST 2018
On Thu, 1 Feb 2018 08:03:31 +0100
Philippe Verdy via Unicode <unicode at unicode.org> wrote:
> 2018-02-01 2:38 GMT+01:00 Richard Wordingham via Unicode <
> unicode at unicode.org>:
>> On Wed, 31 Jan 2018 19:45:56 +0100
>> Philippe Verdy via Unicode <unicode at unicode.org> wrote:
>>> 2018-01-29 21:53 GMT+01:00 Richard Wordingham via Unicode <
>>> unicode at unicode.org>:
>>>> For example, <U WITH DIARESIS AND MACRON,
>>>> COMBINING DOT BELOW> contains, under canonical equivalence, the
>>>> substring <U WITH DIAERESIS, COMBINING DOT BELOW>. Your regular
>>>> expressions would not detect this relationship.
>>> My regular expression WILL detect this: scanning the text implies
>>> first composing it to "full equivalent decomposition
>>> form" (without even reordering it, and possibly recompose it to
>>> NFD) while reading it and bufering it in forward direction (it
>>> just requires the decomposition pairs from the UCD, including
>>> those that are "excluded" from NFC/NFD).
>> To be consistent, to find <U WITH DIAERESIS, COMBINING DOT BELOW>
>> you would construct
(i.e. Philippe Verdy would construct)
>> <LATIN SMALL LETTER U>
>> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]]] *
>> ( <COMBINING DIAERESIS>
>> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]*
>> <COMBINING DOT BELOW>
>> | <COMBINING DOT BELOW>
>> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]*
>> <COMBINING DIAERESIS>
>> )
>> However, <U WITH DIARESIS AND MACRON, COMBINING DOT BELOW>
>> decomposes to <LATIN SMALL LETTER U, COMBINING DIAERESIS, COMBINING
>> MACRON, COMBINING DOT BELOW>. It doesn't match your regular
>> expression, for between COMBINING
>> DIAERESIS and COMBINING DOT BELOW there is COMBINING MACRON, for
>> which ccc = above!
> And my regexp contained all the necessay asterisk, so yes it does not
> match because the combining macron blocks the combining dot below and
> combining diaeresis from commuting, and so there's no canonical
> equivalence.
I'm not sure what you mean by 'commuting' in this case, but either
your statement or your deduction is wrong! Although adjacent characters
with the same non-zero canonical combining class cannot be
interchanged, that does not stop the members of the pair commuting with
their neighbour with a different non-zero ccc whilst preserving
canonical equivalence. Thus the searched string is canonically
equivalent <LATIN SMALL LETTER U, COMBINING DIAERESIS, COMBINING DOT
BELOW, COMBINING MACRON>.
In your scheme, assuming you are looking for the most compact match,
one should generate the regular expression:
<LATIN SMALL LETTER U>
[[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]]] *
( <COMBINING DIAERESIS>
[[ [^[[:cc=0:]]] - [[:cc=below:]] ]]*
<COMBINING DOT BELOW>
| <COMBINING DOT BELOW>
[[ [^[[:cc=0:]]] - [[:cc=above:]] ]]*
<COMBINING DIAERESIS>
)
Have you got a program doing this and reporting to you, or did you
assemble the construction by hand? Constructing regular expressions is
known to be tricky.
You cannot replace this by a more restrictive albeit wordier regex as
you suggested on Sunday 28 January
(http://www.unicode.org/mail-arch/unicode-ml/y2018-m01/0145.html).
There is no upper bound on the length of matching expressions.
>>> The automata is of course using the classic NFA used by regexp
>>> engine (and not the DFA which explodes combinatorially in all
>>> regexps), but which is still fully deterministic (the current
>>> "state" in the automata is not a single integer for the node
>>> number in the traversal graph, but a set of node numbers, and all
>>> regexps have a finite number of nodes in the traversal graph,
>>> this number being proportional to the length of the regexp, it
>>> does not need lot of memory, and the size of the current "state"
>>> is also fully bounded, never larger than the length of the
>>> regexp). Optimizing some contextual parts of the NFA to DFA is
>>> possible (to speedup the matching process and reduce the maximum
>>> storage size of the "current state") but only if it does not
>>> cause a growth of the total number of nodes in the traversal
>>> graph, or as long as this growth of the total number does not
>>> exceed some threshold e.g. not more than 2 or 3 times the regexp
>>> size).
>> In your claim, what is the length of the regexp for searching for ậ
>> in a trace? Is it 3, or is it abut 14? If the former, I am very
>> interested in how you do it. If the latter, I would say you
>> already have a form of blow up in the way you cater for canonical
>> equivalence.
> For searching â, the transformed regexp is just
>
> <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:] ]] *
> <COMBINING CIRCUMFLEX>
> The NFA traversal graph contains nodes at location pointed below by
> apostrophes:
>
> ' <LATIN SMALL LETTER A>
> ' [[ [^[[:cc=0:]]] - [[:cc=above:] ]]
> ' *
> ' <COMBINING CIRCUMFLEX>
> '
>
> It has 5 nodes only (assuming that the regexp engine will compute
> lookup tables to build the character classes).
I asked about ‘ậ’, not ‘â’. Anyway, you’ve effectively answered the
question. You’re talking about regular expressions for strings, not
regular expressions for traces.
> When there's a
> quantifier (like "*" here) which is not "{1,1}" after each character
> or character class it inserts a node after it. No node is inserted in
> the NFA traversal graph for non-capturing parentheses, but nodes may
> be inserted for capturing parentheses, and these's the two nodes
> represening the start of the regexp and the end of the regexp (1 node
> is also insert before the "^" or "$" context delimiters to match the
> start and and of input lines (they are character classes, excluded
> from the capture), or start and end of input text (for regexps using
> the "multiline" flag)
Quantifier nodes like {2,4} probably break down for non-deterministic
expressions like (a(bc)?|b){2,4}. The string "ab" contains 2
iterations, but the longer string "abc" contains one iteration.
> > Even with the dirty trick of normalising the searched trace for
> > input (I wanted the NFA propagation to be defined by the trace - I
> > also didn't want to have to worry about the well-formedness of DFAs
> > or NFAs), I found that the number of states for a concatenation of
> > regular languages of traces was bounded above by the product of the
> > number of states
> For this worst case, you don't generate a product of the two NFA
> traversal graphs, you can directly concatenate each graph, the growth
> in size remaing propertional to the total lenth of the initial
> regexp, with a small bounded factor (this factor depends on how you
> represent each node (with character classes possibly including SOT
> and EOT, or without character classes by separate nodes for each
> character or SOT or EOT). It does not seem unreasonable to build
> these character classes and compute their unions/intersections where
> necessary across branches when factorizing them. you just need to
> take care of nodes added for non-{1,1} quantifiers.
Concatenation doesn’t work with traces when the first expression can end
in non-starters and the second expression can begin with them. This is
a real issue when parsing sequences of characters for grammatical
correctness, which is why I got interested in traces. A regular
trace expression of the form
[:ccc=1:][:ccc=2:]…[:ccc=n:]
seems to require 2^n states in your scheme. As I effectively only
apply the regex to NFD input strings, I use fewer states. However, the
efficiency of my scheme depends on the order of the commuting factors -
reverse order would require the 2^n states.
Richard.
More information about the Unicode
mailing list