Annoyances from Implementation of Canonical Equivalence
Richard Wordingham via Unicode
unicode at unicode.org
Fri Oct 18 07:44:31 CDT 2019
On Fri, 18 Oct 2019 09:45:14 +0300
Eli Zaretskii via Unicode <unicode at unicode.org> wrote:
> > Date: Thu, 17 Oct 2019 21:58:50 +0100
> > From: Richard Wordingham via Unicode <unicode at unicode.org>
> > > Sounds arbitrary to me. How do we know that all the users will
> > > want that?
> > If the change from codepoint by codepoint matching is just canonical
> > equivalence, then there is no way that the ‘n’ of ‘na’ will be
> > matched by the ‘n’ within ‘ñ’.
> "Just canonical equivalence" is also quite arbitrary, for the user's
> POV. At least IME.
Here's a similar issue. If I do an incremental search in Welsh text,
entering bac (on the way to entering bach) will find words like "bach"
and "bachgen" even though their third letter is 'ch', not 'c'.
'Canonical equivalence' is 'DTRT', unless you're working with systems
too lazy or too primitive to DTRT. It involves treating sequences of
character sequences declared to be identical in signification
The only pleasant justification for treating canonical sequences
inequivalently that I can think of is to treat the difference as a way
of recording how the text was typed. Quite a few editing systems erase
that information, and I doubt people care how someone else typed the
More information about the Unicode