Internationalised Computer Science Exercises

Richard Wordingham via Unicode unicode at unicode.org
Mon Jan 29 02:57:41 CST 2018


On Mon, 29 Jan 2018 07:16:04 +0100
Philippe Verdy via Unicode <unicode at unicode.org> wrote:

> 2018-01-28 23:44 GMT+01:00 Richard Wordingham via Unicode <
> unicode at unicode.org>:  

> > In the search you have in mind, the converted regex for use with NFD
> > strings is actually intelligible and simple:
> >
> > <LATIN SMALL LETTER A>
> > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] *
> > <COMBINING DOT BELOW>
> > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] *
> > <COMBINING CIRCUMFLEX>
> >
> > Informal notation can simplify the regex still further.
> >
> > There is no upper bound to the length of a string matching that
> > regex, 
> 
> Wrong, you've not read what followed immediately that commented it
> already: it IS bound exactly because you cannot duplicate the same
> combining class, and there's a known finite number of them for
> acceptable cases: if there's any repetition, it will always be within
> that bound.

Are you talking about regular expressions or strings that match them?
Natural language text can very easily contain adjacent combining
characters of the same combining class - look no further than the
full decomposition of U+01D6 LATIN SMALL LETTER U WITH DIAERESIS AND
MACRON.  For a few combining characters, such as U+1A7F TAI THAM
COMBINING CRYPTOGRAMMIC DOT, repetition is of their very essence.
One can find pairs of combining circumflexes in plain text maths.

Incidentally, I was talking about regular expressions, which imply
*finite* state machines, albeit huge, rather then 'regexes', which are
similar but may formally require unbounded memory.

Richard.


More information about the Unicode mailing list