Internationalised Computer Science Exercises

Philippe Verdy via Unicode unicode at unicode.org
Mon Jan 29 07:15:04 CST 2018


No since the begining we were talking about matching strings that are
canonically equivalent within regexps. So that searching for a regexp
containing precombined characters or decomposed characters would find them
independantly of the encoded form (normalized or not) of the input and
independantly that there are addtional combining characters inserted
between them.

The case of u with diaeresis and macron is simpler: it has two combining
characters of the same combining class and they don't  commute, still the
regexp to match it is something like:

U [[:cc>0:]-[:cc=above:]]* <DIAERESIS> [[:cc>0:]-[:cc=above:]]* <MACRON>
[[:cc>0:]-[:cc=above:]]*

The source is simply decomposed (does not need to be normalized to NFD) and
matched accroding to this transformed regexp but does not need here the
"{exclusive choice list}" notation because DIAERESIS and MACRON do not
commute.



2018-01-29 9:57 GMT+01:00 Richard Wordingham via Unicode <
unicode at unicode.org>:

> On Mon, 29 Jan 2018 07:16:04 +0100
> Philippe Verdy via Unicode <unicode at unicode.org> wrote:
>
> > 2018-01-28 23:44 GMT+01:00 Richard Wordingham via Unicode <
> > unicode at unicode.org>:
>
> > > In the search you have in mind, the converted regex for use with NFD
> > > strings is actually intelligible and simple:
> > >
> > > <LATIN SMALL LETTER A>
> > > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] *
> > > <COMBINING DOT BELOW>
> > > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] *
> > > <COMBINING CIRCUMFLEX>
> > >
> > > Informal notation can simplify the regex still further.
> > >
> > > There is no upper bound to the length of a string matching that
> > > regex,
> >
> > Wrong, you've not read what followed immediately that commented it
> > already: it IS bound exactly because you cannot duplicate the same
> > combining class, and there's a known finite number of them for
> > acceptable cases: if there's any repetition, it will always be within
> > that bound.
>
> Are you talking about regular expressions or strings that match them?
> Natural language text can very easily contain adjacent combining
> characters of the same combining class - look no further than the
> full decomposition of U+01D6 LATIN SMALL LETTER U WITH DIAERESIS AND
> MACRON.  For a few combining characters, such as U+1A7F TAI THAM
> COMBINING CRYPTOGRAMMIC DOT, repetition is of their very essence.
> One can find pairs of combining circumflexes in plain text maths.
>
> Incidentally, I was talking about regular expressions, which imply
> *finite* state machines, albeit huge, rather then 'regexes', which are
> similar but may formally require unbounded memory.
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180129/ae45ad25/attachment.html>


More information about the Unicode mailing list