Internationalised Computer Science Exercises
Philippe Verdy via Unicode
unicode at unicode.org
Mon Jan 29 01:54:29 CST 2018
You may also wonder why I describe a regexp that would never match anything
but would be handled itself as a successful match: it is a useful extension
that allows stopping early the analysis and genenalizes the concept of
negation (defined in character classes with the minus operator).
For example "(b{}|[a-z])*" would match any word made of any letters [a-z]
except that when it encounters any "b" it attempts to match "{}" (which
doesn't match anything as it means "match 1 or more within a list having no
item to choose from), it succeeds early and invalidates all matches in the
alternatives given, in summary it will be mostly the same as
"[[a-z]-[b]]*", or "[ac-z]*", but if there's a "b", it returns an empty
match for it, located just after the "b" found.
Note: the optional quantifiers are the classic ones used in regexps:
- "{m,n}" for m to n occurences,
- "{n}" for exactly n occurences (for n>0), equivalent to "{n,n}" (the
default quantifier is "{1}" or "{1,1}"), except "{0}" which is equivalent
to "{1,0}",
- "{m,}" for at least m occurences,
- "{,n}" or "{0,n}" for at most n occurences,
- "?" for 0 or 1 occurence, equivalent to "{,1}"
- "+" for 1 or more occurences, equivalent to "{1,}"
- "*" for 0 or more occurences, equivalent to "{0,}"
- I don't know if we need here special greedy/non-greedy quantifiers ("?*",
"?+", "??", "*?, "*+", "+?", "+*", and so on...)
However the quantifiers at start of this (unordered) "exclusive choice
lists" extension, with form "{quantifier item1 | item2 | ...}" or "{quantifier
[class]}", do not just count the items, they also disallow multiple
occurences of the same chosen item, unlike the quantifiers used as suffixes
(after a character, character class or subregexp between parentheses).
The number of items in the exclusive choice list is never zero but it may
be a matchable empty string (it cannot be an "empty class", as a character
class must match exactly 1 character to choose from a set of 1-character
values; but the "empty class" is representable as "{{0}}" using a
quantified exclusive choice list). If an item in the exclusive choice list
is an empty string and the quantifier of the choice list is not "{0}", the
exclusive choice list will always match successfully.
I insist on the term "exclusive", because this is the interesting property
that allows bounding the occurences, and on the term "unordered" (which
avoids having to combinatorially specify all the possibile order or
occurence of items in the choice list (if the choice list has (n) items,
there are (n!) possible orders with classic regexps and an item list with
only 13 items would have more than 2^32 orders to specify with classic
regexps; in Unicode, we have up to 255 possible non-zero canonical classes,
so we cannot represent them all in classic regexps within any computer:
(253!) ~5.17e499).
2018-01-29 7:37 GMT+01:00 Philippe Verdy <verdy_p at wanadoo.fr>:
> I made an error for the character class notation:
> "{?optionalquantifier[class]}" should be just
> "{optionalquantifier[class]}"...
>
> So "{?[abc]}" contains 1 item "[abc]" to choose from in any order, it is
> not quantified explicitly so it matches by default 1 or more, but as
> there's only one item, it will match just one "[abc]"
> But "{[abc]}" contains 3 items from the class "[abc]" to choose from in
> any order, so it will match "a", "b", "c", "ab", "ba", "ac", "ca", "abc",
> "acb", "bac", "bca", "cab" or "cba".
> And "{{1}[abc]}" is quantified to match one and only one item, and is
> equivalent to "[abc]" and matches only "a", "b", or "c"
> And "{{0}[abc]}" is quantified to match zero and only zero item (the
> items are not relevant) and will never match anything, just like
> "{{0}a|b|c}" or "{{0}}".
> And "{{2}[abc]}" or "{{2,2}[abc]}" is quantified to match two and only
> two items from the character class, and matches only "ab", "ba", "ac", or
> "ca", it is equivalent to "{{2,2}a|b|c}" or "{{2}a|b|c}".
>
> With that extension you can build the necessary regexps to match canonical
> equivalent strings with a finite regexp.
>
> 2018-01-29 7:16 GMT+01:00 Philippe Verdy <verdy_p at wanadoo.fr>:
>
>>
>>
>> 2018-01-28 23:44 GMT+01:00 Richard Wordingham via Unicode <
>> unicode at unicode.org>:
>>
>>> On Sun, 28 Jan 2018 20:29:28 +0100
>>> Philippe Verdy via Unicode <unicode at unicode.org> wrote:
>>>
>>> > 2018-01-28 5:12 GMT+01:00 Richard Wordingham via Unicode <
>>> > unicode at unicode.org>:
>>> >
>>> > > On Sat, 27 Jan 2018 14:13:40 -0800The theory
>>> > > of regular expressions (though you may not think that mathematical
>>> > > regular expressions matter) extends to trace monoids, with the
>>> > > disturbing exception that the Kleene star of a regular language is
>>> > > not necessarily regular. (The prototypical example is sequences
>>> > > (xy)^n where x and y are distinct and commute, i.e. xy and yx are
>>> > > canonically equivalent in Unicode terms. A Unicode example is the
>>> > > set of strings composed only of U+0F73 TIBETAN VOWEL SIGN II -
>>> > > there is no FSM that will recognise canonically equivalent strings).
>>> > >
>>> >
>>> > I don't see why you can't write this as the regular expression:
>>> > (x | y)*
>>>
>>> Because xx does not match.
>>>
>>> In principle, it can be done iteratively thus:
>>>
>>> 1) Look for sequences of x's and y's - your (x | y) *
>>> 2) Discard matches from (1) where the number of x's and y's are equal.
>>>
>>> However, the second step cannot be implemented by a *finite* state
>>> machine.
>>>
>>> > For the Unicode canonical equivalences, this applies to distinct
>>> > characters that have distinct non-zero combining classes.
>>>
>>> Those of us who've looked at optimising collation by reducing
>>> normalisation will recognise U+0F73 TIBETAN VOWEL SIGN II as, in
>>> theory, a source of many problems.
>>>
>>> > But of course searching for
>>> >
>>> > <LATIN SMALL LETTER A WITH CIRCUMFLEX, COMBINING DOT BELOW> or
>>> > <LATIN SMALL LETTER A WITH DOT BELOW, COMBINING CIRCUMFLEX >
>>> >
>>> > requires transforming it to NFD first as:
>>>
>>> That wasn't I had in mind. What I had in mind was accepting the
>>> propositions that the string <LATIN SMALL LETTER A, COMBINING DOT
>>> BELOW, COMBINING CIRCUMFLEX> contains both LATIN SMALL LETTER A WITH
>>> CIRCUMFLEX and LATIN SMALL LETTER A WITH DOT BELOW.
>>>
>>> > <LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING CIRCUMFLEX>
>>> >
>>> > so thet the regexp transforms to:
>>> >
>>> > <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] -
>>> > [[:cc=above:][:cc=below:]] ]] * ( <COMBINING CIRCUMFLEX> *
>>> > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * <COMBINING DOT
>>> > BELOW> | <COMBINING DOT BELOW> * [[ [^[[:cc=0:]]] -
>>> > BELOW> [[:cc=above:][:cc=below:]]
>>> > ]] * < COMBINING CIRCUMFLEX>
>>>
>>> If everything is converted to NFD, the regular expressions using traces
>>> can be converted to frequently unintelligible regexes on strings; the
>>> behaviour of the converted regex when faced with an unnormalised string
>>> is of course irrelevant.
>>>
>>> In the search you have in mind, the converted regex for use with NFD
>>> strings is actually intelligible and simple:
>>>
>>> <LATIN SMALL LETTER A>
>>> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] *
>>> <COMBINING DOT BELOW>
>>> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] *
>>> <COMBINING CIRCUMFLEX>
>>>
>>> Informal notation can simplify the regex still further.
>>>
>>> There is no upper bound to the length of a string matching that regex,
>>>
>>
>> Wrong, you've not read what followed immediately that commented it
>> already: it IS bound exactly because you cannot duplicate the same
>> combining class, and there's a known finite number of them for acceptable
>> cases: if there's any repetition, it will always be within that bound. But
>> it not necessay to expand all the combinations of combining classes to all
>> their possible ordering of occurence (something that a classic regexp
>> normally requires by expecting a specific order).
>>
>> One way to solve it would have to have (generic) regexp extension
>> allowing to specify a combination of one or more of several items in a
>> choice list in any order, but never more than one occurence of each of
>> item. This is possible using a rule with boolean flags of presence, one
>> boolean for each item in the choice list.
>>
>> Something like {?a|b|c|d} matching zero or more (or all of them) of
>> a,b,c,d (these can be subregexps) in any order, and {?+a|b|c|d}
>> matching one or more, and {?{m,n}a|b|c|d} matching betwen m and n of
>> them (in any order in all cases)
>> So that {?a|b|c|d}{1,1} is the same as (a|b|c|d) but without the
>> capture, i.e. (?:a|b|c|d), and {?{m,n}a} is the same as a{m,n}, and {?+a}
>> is the same as a, and {?*a} is the same as a?
>>
>> Which can also be written respectrively as {?*[abcd]}, {?+[abcd]} and {?{m,n}[abcd])
>> if the items of the choice list are characters that can be compacted with
>> the classic "character class" notation [abcd].
>>
>> In all these the "{?quantifier list}" notation is always bound by the
>> number of items in the list (independantly of the quantifier, and if
>> individual items in the list are bound in length, the whole will be bound
>> by the sum of their lengths. So even if the quantifier is higher than than
>> the number of items in the list, it will be capped: "{?{1000}a}" will only
>> match "a", and "{?{1000}}" will never match anything (because the list
>> is empty: the specified higher bound 1000 is capped to 0 but the specified
>> lower bound 1000 is capped to 1 and this is impossible) and is also
>> equivalent to {?} where the min-max specified bounds are 1 by default, but
>> capped to 1,0
>>
>>
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180129/4832a26c/attachment.html>
More information about the Unicode
mailing list