Internationalised Computer Science Exercises

Philippe Verdy via Unicode unicode at unicode.org
Mon Jan 29 00:37:46 CST 2018


I made an error for the character class notation:
"{?optionalquantifier[class]}" should be just
"{optionalquantifier[class]}"...

So "{?[abc]}" contains 1 item "[abc]" to choose from in any order, it is
not quantified explicitly so it matches by default 1 or more, but as
there's only one item, it will match just one "[abc]"
But "{[abc]}" contains 3 items from the class "[abc]" to choose from in any
order, so it will match "a", "b", "c", "ab", "ba", "ac", "ca", "abc",
"acb", "bac", "bca", "cab" or "cba".
And "{{1}[abc]}" is quantified to match one and only one item, and is
equivalent to "[abc]" and matches only "a", "b", or "c"
And "{{0}[abc]}" is quantified to match zero and only zero item (the items
are not relevant) and will never match anything, just like "{{0}a|b|c}" or
"{{0}}".
And  "{{2}[abc]}" or "{{2,2}[abc]}"  is quantified to match two and only
two items from the character class, and matches only "ab", "ba", "ac", or
"ca", it is equivalent to "{{2,2}a|b|c}" or  "{{2}a|b|c}".

With that extension you can build the necessary regexps to match canonical
equivalent strings with a finite regexp.

2018-01-29 7:16 GMT+01:00 Philippe Verdy <verdy_p at wanadoo.fr>:

>
>
> 2018-01-28 23:44 GMT+01:00 Richard Wordingham via Unicode <
> unicode at unicode.org>:
>
>> On Sun, 28 Jan 2018 20:29:28 +0100
>> Philippe Verdy via Unicode <unicode at unicode.org> wrote:
>>
>> > 2018-01-28 5:12 GMT+01:00 Richard Wordingham via Unicode <
>> > unicode at unicode.org>:
>> >
>> > > On Sat, 27 Jan 2018 14:13:40 -0800The theory
>> > > of regular expressions (though you may not think that mathematical
>> > > regular expressions matter) extends to trace monoids, with the
>> > > disturbing exception that the Kleene star of a regular language is
>> > > not necessarily regular.  (The prototypical example is sequences
>> > > (xy)^n where x and y are distinct and commute, i.e. xy and yx are
>> > > canonically equivalent in Unicode terms.  A Unicode example is the
>> > > set of strings composed only of U+0F73 TIBETAN VOWEL SIGN II -
>> > > there is no FSM that will recognise canonically equivalent strings).
>> > >
>> >
>> > I don't see why you can't write this as the regular expression:
>> >   (x | y)*
>>
>> Because xx does not match.
>>
>> In principle, it can be done iteratively thus:
>>
>> 1) Look for sequences of x's and y's - your (x | y) *
>> 2) Discard matches from (1) where the number of x's and y's are equal.
>>
>> However, the second step cannot be implemented by a *finite* state
>> machine.
>>
>> > For the Unicode canonical equivalences, this applies to distinct
>> > characters that have distinct non-zero combining classes.
>>
>> Those of us who've looked at optimising collation by reducing
>> normalisation will recognise U+0F73 TIBETAN VOWEL SIGN II as, in
>> theory, a source of many problems.
>>
>> > But of course searching for
>> >
>> >   <LATIN SMALL LETTER A WITH CIRCUMFLEX, COMBINING DOT BELOW> or
>> >   <LATIN SMALL LETTER A WITH DOT BELOW, COMBINING CIRCUMFLEX >
>> >
>> > requires transforming it to NFD first as:
>>
>> That wasn't I had in mind.  What I had in mind was accepting the
>> propositions that the string <LATIN SMALL LETTER A, COMBINING DOT
>> BELOW, COMBINING  CIRCUMFLEX> contains both LATIN SMALL LETTER A WITH
>> CIRCUMFLEX and LATIN SMALL LETTER A WITH DOT BELOW.
>>
>> >   <LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING CIRCUMFLEX>
>> >
>> > so thet the regexp transforms to:
>> >
>> >   <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] -
>> > [[:cc=above:][:cc=below:]] ]] * ( <COMBINING CIRCUMFLEX> *
>> > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * <COMBINING DOT
>> > BELOW> |  <COMBINING DOT BELOW> * [[ [^[[:cc=0:]]] -
>> > BELOW> [[:cc=above:][:cc=below:]]
>> > ]] * < COMBINING  CIRCUMFLEX>
>>
>> If everything is converted to NFD, the regular expressions using traces
>> can be converted to frequently unintelligible regexes on strings; the
>> behaviour of the converted regex when faced with an unnormalised string
>> is of course irrelevant.
>>
>> In the search you have in mind, the converted regex for use with NFD
>> strings is actually intelligible and simple:
>>
>> <LATIN SMALL LETTER A>
>> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] *
>> <COMBINING DOT BELOW>
>> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] *
>> <COMBINING CIRCUMFLEX>
>>
>> Informal notation can simplify the regex still further.
>>
>> There is no upper bound to the length of a string matching that regex,
>>
>
> Wrong, you've not read what followed immediately that commented it
> already: it IS bound exactly because you cannot duplicate the same
> combining class, and there's a known finite number of them for acceptable
> cases: if there's any repetition, it will always be within that bound. But
> it not necessay to expand all the combinations of combining classes to all
> their possible ordering of occurence (something that a classic regexp
> normally requires by expecting a specific order).
>
> One way to solve it would have to have (generic) regexp extension allowing
> to specify a combination of one or more of several items in a choice list
> in any order, but never more than one occurence of each of item. This is
> possible using a rule with boolean flags of presence, one boolean for each
> item in the choice list.
>
> Something like {?a|b|c|d} matching zero or more (or all of them) of
> a,b,c,d (these can be subregexps) in any order, and  {?+a|b|c|d} matching
> one or more, and {?{m,n}a|b|c|d} matching betwen m and n of them (in any
> order in all cases)
> So that {?a|b|c|d}{1,1} is the same as (a|b|c|d) but without the capture,
> i.e.  (?:a|b|c|d), and {?{m,n}a} is the same as a{m,n}, and  {?+a} is the
> same as a, and  {?*a} is the same as a?
>
> Which can also be written respectrively as  {?*[abcd]},  {?+[abcd]} and  {?{m,n}[abcd])
> if the items of the choice list are characters that can be compacted with
> the classic "character class" notation [abcd].
>
> In all these the "{?quantifier list}" notation is always bound by the
> number of items in the list (independantly of the quantifier, and if
> individual items in the list are bound in length, the whole will be bound
> by the sum of their lengths. So even if the quantifier is higher than than
> the number of items in the list, it will be capped:  "{?{1000}a}" will only
> match "a", and  "{?{1000}}" will never match anything (because the list
> is empty: the specified higher bound 1000 is capped to 0 but the specified
> lower bound 1000 is capped to 1 and this is impossible) and is also
> equivalent to {?} where the min-max specified bounds are 1 by default, but
> capped to 1,0
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180129/3e5bc925/attachment-0001.html>


More information about the Unicode mailing list