Pure Regular Expression Engines and Literal Clusters

Richard Wordingham via Unicode unicode at unicode.org
Mon Oct 14 13:29:39 CDT 2019

On Mon, 14 Oct 2019 10:05:49 +0300
Eli Zaretskii via Unicode <unicode at unicode.org> wrote:

> > Date: Mon, 14 Oct 2019 01:10:45 +0100
> > From: Richard Wordingham via Unicode <unicode at unicode.org>

> > They hadn't given any thought to [\p{L}&&\p{isNFD}]\p{gcb=extend}*,
> > and were expecting normalisation (even to NFC) to be a possible
> > cure.  They had begun to realise that converting expressions to
> > match all or none of a set of canonical equivalents was hard; the
> > issue of non-contiguous matches wasn't mentioned.  

> I think these are two separate issues: whether search should normalize
> (a.k.a. performs character folding) should be a user option.  You are
> talking only about canonical equivalence, but there's also
> compatibility decomposition, so, for example, searching for "1" should
> perhaps match ¹ and ①.


The official position is that text that is canonically
equivalent is the same.  There are problem areas where traditional
modes of expression require that canonically equivalent text be treated
differently.  For these, it is useful to have tools that treat them
differently.  However, the normal presumption should be that
canonically equivalent text is the same.

The party line seems to be that most searching should actually be done
using a 'collation', which brings with it different levels of
'folding'.  In multilingual use, a collation used for searching should
be quite different to one used for sorting.

Now, there is a case for being able to switch off normalisation and
canonical equivalence generally, e.g. when dealing with ISO 10646 text
instead of Unicode text.  This of course still leaves the question of
what character classes defined by Unicode properties then mean.

If one converts the regular expression so that what it matches is
closed under canonical equivalence, then visibly normalising the
searched text becomes irrelevant.  For working with Unicode traces, I
actually do both.  I convert the text to NFD but report matches in terms
of the original code point sequence; working this way simplifies the
conversion of the regular expression, which I do as part of its
compilation.  For traces, it seems only natural to treat precomposed
characters as syntactic sugar for the NFD decompositions.  (They
have no place in the formal theory of traces.)  However, I go further
and convert the decomposed text to NFD. (Recall that conversion to NFD
can change the stored order of combining marks.)

One of the simplifications I get is that straight runs of text in the
regular expression then match in the middle just by converting that run
and the searched strings.

For the concatenation of expressions A and B, once I am looking at the
possible interleaving of two traces, I am dealing with NFA states of
the form states(A) × {1..254} × states(B), so that for an element (a,
n, b), a corresponds to starts of words with a match in A, b
corresponds to starts of _words_ with a match in B, and n is the ccc
of the last character used to advance to b.  The element n blocks
non-starters that can't belong to a word matching A.  If I didn't
(internally) convert the searched text to NFD, the element n would have
to be a set of blocked canonical combining classes, changing the number
of possible values from 54 to 2^54 - 1.

While aficionados of regular languages may object that converting the
searched text to NFD is cheating, there is a theorem that if I have a
finite automaton that recognises a family of NFD strings, there is
another finite automaton that will recognise all their canonical


More information about the Unicode mailing list