Annoyances from Implementation of Canonical Equivalence

Richard Wordingham via Unicode unicode at unicode.org
Wed Oct 16 20:26:35 CDT 2019


On Wed, 16 Oct 2019 09:33:38 +0300
Eli Zaretskii via Unicode <unicode at unicode.org> wrote:

> > These are complaints about primary-level searches, not canonical
> > equivalence.  
> 
> Not sure what you call primary-level searches, but if you deduced the
> complaints were only about searches for base characters, then that's
> not so.  They are long discussions with many sub-threads, so it might
> be hard to find the specific details you are looking for.

The nearest I've found to complaints about including canonical
equivalences are:

(a) an observation that very occasionally one would need to switch
canonical equivalence off.  In such cases, one is not concerned with
the text as such, but rather with how Unicode non-compliant processes
will handle it.  Compliant processes are often built out of
non-compliant processes.

(b) just possibly

"What we have seen is that the behavior that comes from that Unicode
data does not please the users very much.  Users seem to have many
different ideas of what folding is useful, and disagree with each
other greatly." -
https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg01359.html

I can't tell what (b) was talking about; it may well have been about
folding or asymmetric search, as opposed to supporting canonical
equivalence.

(c) A search for 'n' finding 'ñ'.

When it comes to canonical equivalence, one answer to (c) is that as
soon as one adds the next letter letter, e.g. 'na', the search will no
longer match 'ñ'.  (This doesn't apply to diacritic-ignoring folding.)
That argument doesn't work with the Polish letter 'ń' though, as it can
be word-final.

In programming, one might be able to prevent the issue
by using 'n\b{g}', but that is a requirement of RL2.2, which doesn't
seem to be high on the list of implementer's priorities, especially as
it depends on properties outwith the UCD, defined in a non-ASCII file
to boot.  A better supported solution is probably 'n\P{Mn}'.

In many cases, the answer might be a search by collation graphemes, but
that has other issues besides language sensitivity.

Richard.



More information about the Unicode mailing list