Pure Regular Expression Engines and Literal Clusters

Eli Zaretskii via Unicode unicode at unicode.org
Mon Oct 14 13:41:19 CDT 2019

> Date: Mon, 14 Oct 2019 19:29:39 +0100
> From: Richard Wordingham via Unicode <unicode at unicode.org>
> On Mon, 14 Oct 2019 10:05:49 +0300
> Eli Zaretskii via Unicode <unicode at unicode.org> wrote:
> > I think these are two separate issues: whether search should normalize
> > (a.k.a. performs character folding) should be a user option.  You are
> > talking only about canonical equivalence, but there's also
> > compatibility decomposition, so, for example, searching for "1" should
> > perhaps match ¹ and ①.
> The official position is that text that is canonically
> equivalent is the same.  There are problem areas where traditional
> modes of expression require that canonically equivalent text be treated
> differently.  For these, it is useful to have tools that treat them
> differently.  However, the normal presumption should be that
> canonically equivalent text is the same.

I'm well aware of the official position.  However, when we attempted
to implement it unconditionally in Emacs, some people objected, and
brought up good reasons.  You can, of course, elect to disregard this
experience, and instead learn it from your own.

> The party line seems to be that most searching should actually be done
> using a 'collation', which brings with it different levels of
> 'folding'.  In multilingual use, a collation used for searching should
> be quite different to one used for sorting.

Alas, collation is locale- and language-dependent.  And, if you are
going to use your search in a multilingual application (Emacs is such
an application), you will have hard time even knowing which tailoring
to apply for each potential match, because you will need to support
the use case of working with text that mixes languages.

Leaving the conundrum to the user to resolve seems to be a good
compromise, and might actually teach us something that is useful for
future modifications of the "party line".

More information about the Unicode mailing list