Annoyances from Implementation of Canonical Equivalence

Richard Wordingham via Unicode unicode at
Fri Oct 18 07:44:31 CDT 2019

On Fri, 18 Oct 2019 09:45:14 +0300
Eli Zaretskii via Unicode <unicode at> wrote:

> > Date: Thu, 17 Oct 2019 21:58:50 +0100
> > From: Richard Wordingham via Unicode <unicode at>
> >   
> > > Sounds arbitrary to me.  How do we know that all the users will
> > > want that?  
> > 
> > If the change from codepoint by codepoint matching is just canonical
> > equivalence, then there is no way that the ‘n’ of ‘na’ will be
> > matched by the ‘n’ within ‘ñ’.  
> "Just canonical equivalence" is also quite arbitrary, for the user's
> POV.  At least IME.

Here's a similar issue.  If I do an incremental search in Welsh text,
entering bac (on the way to entering bach) will find words like "bach"
and  "bachgen" even though their third letter is 'ch', not 'c'.

'Canonical equivalence' is 'DTRT', unless you're working with systems
too lazy or too primitive to DTRT.  It involves treating sequences of
character sequences declared to be identical in signification

The only pleasant justification for treating canonical sequences
inequivalently that I can think of is to treat the difference as a way
of recording how the text was typed.  Quite a few editing systems erase
that information, and I doubt people care how someone else typed the


More information about the Unicode mailing list