Internationalised Computer Science Exercises

Philippe Verdy via Unicode unicode at unicode.org
Sun Jan 28 13:45:36 CST 2018


Note that for finding occurence of simpler combining sequences such
as  finding <LATIN SMALL LETTER A WITH CIRCUMFLEX> the regexp is simpler:
<LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:]] ]] * <COMBINING
CIRCUMFLEX>

The central character class allows 53 distinct combining classes, and the
maximum match length is 2+53=55 characters.

If Unicode assigns new combining classes for new combining characters, the
maximum match length will increase by 1 character for this regexp, and by 2
characters in the previous example.
As there can be at most 255 non-zero combining classes (due to current
stability rules), finding  <LATIN SMALL LETTER A WITH CIRCUMFLEX> will
match at most 1+253+1 = 255 characters in any future version of Unicode,
and finding  <LATIN SMALL LETTER A WITH CIRCUMFLEX,DOT BELOW>  will match
at most 1+252+1+252+1 = 507 characters.

This is still finite, small enough to be implementable with a deterministic
FSM, using no more than 1 codepoint of lookahead, without using any
backtrailing.


2018-01-28 20:30 GMT+01:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> Typo, the full regexp has undesired asterisks:
>
>   <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] *
>   ( <COMBINING CIRCUMFLEX> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]
> ]] * <COMBINING DOT BELOW>
>   |  <COMBINING DOT BELOW> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]
> ]] * < COMBINING  CIRCUMFLEX>
>
>
>
> 2018-01-28 20:29 GMT+01:00 Philippe Verdy <verdy_p at wanadoo.fr>:
>
>>
>>
>> 2018-01-28 5:12 GMT+01:00 Richard Wordingham via Unicode <
>> unicode at unicode.org>:
>>
>>> On Sat, 27 Jan 2018 14:13:40 -0800The theory
>>> of regular expressions (though you may not think that mathematical
>>> regular expressions matter) extends to trace monoids, with the
>>> disturbing exception that the Kleene star of a regular language is not
>>> necessarily regular.  (The prototypical example is sequences (xy)^n
>>> where x and y are distinct and commute, i.e. xy and yx are canonically
>>> equivalent in Unicode terms.  A Unicode example is the set of strings
>>> composed only of U+0F73 TIBETAN VOWEL SIGN II - there is no FSM that
>>> will recognise canonically equivalent strings).
>>>
>>
>> I don't see why you can't write this as the regular expression:
>>   (x | y)*
>> For the Unicode canonical equivalences, this applies to distinct
>> characters that have distinct non-zero combining classes.
>>
>> But of course searching for
>>
>>   <LATIN SMALL LETTER A WITH CIRCUMFLEX, COMBINING DOT BELOW> or
>>   <LATIN SMALL LETTER A WITH DOT BELOW, COMBINING CIRCUMFLEX >
>>
>> requires transforming it to NFD first as:
>>
>>   <LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING CIRCUMFLEX>
>>
>> so thet the regexp transforms to:
>>
>>   <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]
>> *
>>   ( <COMBINING CIRCUMFLEX> * [[ [^[[:cc=0:]]] -
>> [[:cc=above:][:cc=below:]] ]] * <COMBINING DOT BELOW>
>>   |  <COMBINING DOT BELOW> * [[ [^[[:cc=0:]]] -
>> [[:cc=above:][:cc=below:]] ]] * < COMBINING  CIRCUMFLEX>
>>
>> Note that the "complex" set of characters used three times above is
>> finite, it contains all combining characters of Unicode that have a
>> non-zero combining class except above and below, i.e. all Unicode
>> characters whose combining class is not 0, 220 (below) or 230 (above).
>>
>> However, It is too simplified, because the allowed combining classes must
>> occur at most once in each possible non-zero combining class and not
>> arbitrary numbers of them: these allowed combining classs currently are in
>> {1, 7..36, 84, 91, 103, 107, 118, 122, 129, 130, 132, 202, 214, 216, 218,
>> 222, 224, 226, 228, 232..234, 240} whose most member elements are used for
>> very few combining characters (the above and below combining classes are
>> the most populated ones but we exclude them, all the others have 1 to 9
>> combining characters assigned to them, or 22 characters with cc=7 (nukta),
>> or 32 characters with cc=1 (overlay), or 47 characters with cc=9 (virama).
>>
>> Once again we can refine them also as a regexp, but this is combinatorial
>> because we have 52 combining classes (so we would need to consider the 52!
>> (factorial) alternates). But given the maximum length of what this can
>> match (0 to 52 combining characters: yes it is finite), this is best done
>> by not rewriting this as a regexp but by replacing the final "*" by {1,52},
>> and then just check each returned match of
>>
>>   [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]{0,52}
>>
>> with a simple scan of these short strings to see that they all have
>> distinct combining classes (this just requires 52 booleans, easily stored
>> in a single 64 bit integer initialized to 0 prior to scan the scan of these
>> small strings). But the theory does not prevent writing it as a regexp
>> (even if it would be extremely long). So a  Kleene Star closure is
>> possible and can be efficiently implemented (all depends on the way  you
>> represent the "current state" in the FSM: a single integer representing a
>> single node number in the traversal graph is not the best way to do that.
>>
>> This is a valid regexp, the finite state machine DOES have a finite
>> lookahead (the full regexp above will match AT MOST 107 characters
>> including the combining marks, where 107=3+2*52), but this is general to
>> regexps that generally cannot be transformed directly into a FSM with
>> finite lookahead, but a FSM is possible: the regexp first transforms into a
>> simple graph of transitions with a finite number of node (this number is
>> bound to the length of the regexp itself) where there can be multiple
>> states active simultaneously; then a basic transform converts this small
>> graph by transforming nodes into new nodes representing the finite set of
>> the combinations of active states in the first graph :
>>
>> There will be many more nodes, and generally this explodes in size
>> because the transform is combinatorial, and such size explosion has worst
>> perfomance (explosion of the memory needed to representing the new graph
>> with a single state active). So regexp engines use the alternative by
>> representing the current state of traversal of the first simple graph using
>> a stack of active states and transiting them all separately (this can be
>> implemented with a "bitset" whose size in bits is the number of states in
>> the first simple graph, or by using an associative array (dictionnary of
>> boolean properties whose keys are state numbers in the first graph, which
>> can be set  or removed: this requires much less memory and it is relatively
>> fast, even if the full current state is not just a single small integer.
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180128/2edf282e/attachment.html>


More information about the Unicode mailing list