Internationalised Computer Science Exercises

Richard Wordingham via Unicode unicode at unicode.org
Thu Feb 1 13:20:04 CST 2018


On Thu, 1 Feb 2018 08:03:31 +0100
Philippe Verdy via Unicode <unicode at unicode.org> wrote:

> 2018-02-01 2:38 GMT+01:00 Richard Wordingham via Unicode <
> unicode at unicode.org>:  

>> On Wed, 31 Jan 2018 19:45:56 +0100
>> Philippe Verdy via Unicode <unicode at unicode.org> wrote:
  
>>> 2018-01-29 21:53 GMT+01:00 Richard Wordingham via Unicode <  
>>> unicode at unicode.org>:  

>>>> For example, <U WITH DIARESIS AND MACRON,
>>>> COMBINING DOT BELOW> contains, under canonical equivalence, the
>>>> substring <U WITH DIAERESIS, COMBINING DOT BELOW>.  Your regular
>>>> expressions would not detect this relationship.  

>>> My regular expression WILL detect this: scanning the text implies
>>> first composing it to "full equivalent decomposition
>>> form" (without even reordering it, and possibly recompose it to
>>> NFD) while reading it and bufering it in forward direction (it
>>> just requires the decomposition pairs from the UCD, including
>>> those that are "excluded" from NFC/NFD).

>> To be consistent, to find <U WITH DIAERESIS, COMBINING DOT BELOW>
>> you would construct
(i.e. Philippe Verdy would construct)

>>   <LATIN SMALL LETTER U>
>>   [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]]] *
>>  ( <COMBINING DIAERESIS>
>>    [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]*
>>    <COMBINING DOT BELOW>
>>  | <COMBINING DOT BELOW>
>>    [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]*
>>    <COMBINING DIAERESIS>
>>  )

>> However, <U WITH DIARESIS AND MACRON, COMBINING DOT BELOW>
>> decomposes to <LATIN SMALL LETTER U, COMBINING DIAERESIS, COMBINING
>> MACRON, COMBINING DOT BELOW>.  It doesn't match your regular
>> expression, for between COMBINING
>> DIAERESIS and COMBINING DOT BELOW there is COMBINING MACRON, for
>> which ccc = above!

> And my regexp contained all the necessay asterisk, so yes it does not
> match because the combining macron blocks the combining dot below and
> combining diaeresis from commuting, and so there's no canonical
> equivalence.

I'm not sure what you mean by 'commuting' in this case, but either
your statement or your deduction is wrong! Although adjacent characters
with the same non-zero canonical combining class cannot be
interchanged, that does not stop the members of the pair commuting with
their neighbour with a different non-zero ccc whilst preserving
canonical equivalence.  Thus the searched string is canonically
equivalent <LATIN SMALL LETTER U, COMBINING DIAERESIS, COMBINING DOT
BELOW, COMBINING MACRON>.

In your scheme, assuming you are looking for the most compact match,
one should generate the regular expression:

<LATIN SMALL LETTER U>
[[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]]] *
( <COMBINING DIAERESIS>
  [[ [^[[:cc=0:]]] - [[:cc=below:]] ]]*
  <COMBINING DOT BELOW>
| <COMBINING DOT BELOW>
  [[ [^[[:cc=0:]]] - [[:cc=above:]] ]]*
  <COMBINING DIAERESIS>
)

Have you got a program doing this and reporting to you, or did you
assemble the construction by hand?  Constructing regular expressions is
known to be tricky.

You cannot replace this by a more restrictive albeit wordier regex as
you suggested on Sunday 28 January
(http://www.unicode.org/mail-arch/unicode-ml/y2018-m01/0145.html).
There is no upper bound on the length of matching expressions.

>>> The automata is of course using the classic NFA used by regexp
>>> engine (and not the DFA which explodes combinatorially in all
>>> regexps), but which is still fully deterministic (the current
>>> "state" in the automata is not a single integer for the node
>>> number in the traversal graph, but a set of node numbers, and all
>>> regexps have a finite number of nodes in the traversal graph,
>>> this number being proportional to the length of the regexp, it
>>> does not need lot of memory, and the size of the current "state"
>>> is also fully bounded, never larger than the length of the
>>> regexp). Optimizing some contextual parts of the NFA to DFA is
>>> possible (to speedup the matching process and reduce the maximum
>>> storage  size of the "current state") but only if it does not
>>> cause a growth of the total number of nodes in the traversal
>>> graph, or as long as this growth of the total number does not
>>> exceed some threshold e.g. not more than 2 or 3 times the regexp
>>> size).  

>> In your claim, what is the length of the regexp for searching for ậ
>> in a trace? Is it 3, or is it abut 14?  If the former, I am very
>> interested in how you do it.  If the latter, I would say you
>> already have a form of blow up in the way you cater for canonical
>> equivalence. 
 
> For searching â, the transformed regexp is just
> 
>  <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:] ]] *
> <COMBINING CIRCUMFLEX>  

> The NFA traversal graph contains nodes at location pointed below by
> apostrophes:
> 
>   ' <LATIN SMALL LETTER A>
>   '  [[ [^[[:cc=0:]]] - [[:cc=above:] ]]
>   ' *
>   ' <COMBINING CIRCUMFLEX>
>   '
> 
> It has 5 nodes only (assuming that the regexp engine will compute
> lookup tables to build the character classes).

I asked about ‘ậ’, not ‘â’.  Anyway, you’ve effectively answered the
question. You’re talking about regular expressions for strings, not
regular expressions for traces.

> When there's a
> quantifier (like "*" here) which is not "{1,1}" after each character
> or character class it inserts a node after it. No node is inserted in
> the NFA traversal graph for non-capturing parentheses, but nodes may
> be inserted for capturing parentheses, and these's the two nodes
> represening the start of the regexp and the end of the regexp (1 node
> is also insert before the "^" or "$" context delimiters to match the
> start and and of input lines (they are character classes, excluded
> from the capture), or start and end of input text (for regexps using
> the "multiline" flag)

Quantifier nodes like {2,4} probably break down for non-deterministic
expressions like (a(bc)?|b){2,4}.  The string "ab" contains 2
iterations, but the longer string "abc" contains one iteration.

> > Even with the dirty trick of normalising the searched trace for
> > input (I wanted the NFA propagation to be defined by the trace - I
> > also didn't want to have to worry about the well-formedness of DFAs
> > or NFAs), I found that the number of states for a concatenation of
> > regular languages of traces was bounded above by the product of the
> > number of states  

> For this worst case, you don't generate a product of the two NFA
> traversal graphs, you can directly concatenate each graph, the growth
> in size remaing propertional to the total lenth of the initial
> regexp, with a small bounded factor (this factor depends on how you
> represent each node (with character classes possibly including SOT
> and EOT, or without character classes by separate nodes for each
> character or SOT or EOT). It does not seem unreasonable to build
> these character classes and compute their unions/intersections where
> necessary across branches when factorizing them. you just need to
> take care of nodes added for non-{1,1} quantifiers.

Concatenation doesn’t work with traces when the first expression can end
in non-starters and the second expression can begin with them.  This is
a real issue when parsing sequences of characters for grammatical
correctness, which is why I got interested in traces.  A regular
trace expression of the form

[:ccc=1:][:ccc=2:]…[:ccc=n:]

seems to require 2^n states in your scheme.  As I effectively only
apply the regex to NFD input strings, I use fewer states.  However, the
efficiency of my scheme depends on the order of the commuting factors -
reverse order would require the 2^n states.

Richard.



More information about the Unicode mailing list