Internationalised Computer Science Exercises

Sun Jan 28 16:44:56 CST 2018

On Sun, 28 Jan 2018 20:29:28 +0100
Philippe Verdy via Unicode <unicode at unicode.org> wrote:

> 2018-01-28 5:12 GMT+01:00 Richard Wordingham via Unicode <
> unicode at unicode.org>:  
> 
> > On Sat, 27 Jan 2018 14:13:40 -0800The theory
> > of regular expressions (though you may not think that mathematical
> > regular expressions matter) extends to trace monoids, with the
> > disturbing exception that the Kleene star of a regular language is
> > not necessarily regular.  (The prototypical example is sequences
> > (xy)^n where x and y are distinct and commute, i.e. xy and yx are
> > canonically equivalent in Unicode terms.  A Unicode example is the
> > set of strings composed only of U+0F73 TIBETAN VOWEL SIGN II -
> > there is no FSM that will recognise canonically equivalent strings).
> >  
> 
> I don't see why you can't write this as the regular expression:
>   (x | y)*

Because xx does not match.

In principle, it can be done iteratively thus:

1) Look for sequences of x's and y's - your (x | y) *
2) Discard matches from (1) where the number of x's and y's are equal.

However, the second step cannot be implemented by a *finite* state
machine.

> For the Unicode canonical equivalences, this applies to distinct
> characters that have distinct non-zero combining classes.

Those of us who've looked at optimising collation by reducing
normalisation will recognise U+0F73 TIBETAN VOWEL SIGN II as, in
theory, a source of many problems.

> But of course searching for
> 
>   <LATIN SMALL LETTER A WITH CIRCUMFLEX, COMBINING DOT BELOW> or
>   <LATIN SMALL LETTER A WITH DOT BELOW, COMBINING CIRCUMFLEX >
> 
> requires transforming it to NFD first as:

That wasn't I had in mind.  What I had in mind was accepting the
propositions that the string <LATIN SMALL LETTER A, COMBINING DOT
BELOW, COMBINING  CIRCUMFLEX> contains both LATIN SMALL LETTER A WITH
CIRCUMFLEX and LATIN SMALL LETTER A WITH DOT BELOW.

>   <LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING CIRCUMFLEX>
> 
> so thet the regexp transforms to:
> 
>   <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] -
> [[:cc=above:][:cc=below:]] ]] * ( <COMBINING CIRCUMFLEX> *
> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * <COMBINING DOT
> BELOW> |  <COMBINING DOT BELOW> * [[ [^[[:cc=0:]]] -
> BELOW> [[:cc=above:][:cc=below:]]
> ]] * < COMBINING  CIRCUMFLEX>

If everything is converted to NFD, the regular expressions using traces
can be converted to frequently unintelligible regexes on strings; the
behaviour of the converted regex when faced with an unnormalised string
is of course irrelevant.

In the search you have in mind, the converted regex for use with NFD
strings is actually intelligible and simple:

<LATIN SMALL LETTER A>
[[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * 
<COMBINING DOT BELOW>
[[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * 
<COMBINING CIRCUMFLEX>

Informal notation can simplify the regex still further.

There is no upper bound to the length of a string matching that regex,
though examples in correctly spelt natural languages are quite limited
in length.

Of course, what one is interested in is the input form of the match to
<LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING CIRCUMFLEX>.
That can be in three parts, and some of the parts may contain parts of
composed characters.  There isn't a widely used notation for such
discontiguous, character-splitting substrings.

What can get nasty is NFD regexes for things like 

[[:InPC=Top:]] [[:InPC=Bottom:]]

You don't want to craft these by hand.  You just want it to match
<U+0F73 TIBETAN VOWEL SIGN II> and its canonical equivalents, including
the NFD form:

 <U+0F71 TIBETAN VOWEL SIGN AA, U+0F72 TIBETAN VOWEL SIGN I>

This is a bottom vowel followed by a top vowel.

Richard.