Annoyances from Implementation of Canonical Equivalence

Richard Wordingham via Unicode unicode at unicode.org
Fri Oct 18 07:44:31 CDT 2019


On Fri, 18 Oct 2019 09:45:14 +0300
Eli Zaretskii via Unicode <unicode at unicode.org> wrote:

> > Date: Thu, 17 Oct 2019 21:58:50 +0100
> > From: Richard Wordingham via Unicode <unicode at unicode.org>
> >   
> > > Sounds arbitrary to me.  How do we know that all the users will
> > > want that?  
> > 
> > If the change from codepoint by codepoint matching is just canonical
> > equivalence, then there is no way that the ‘n’ of ‘na’ will be
> > matched by the ‘n’ within ‘ñ’.  
> 
> "Just canonical equivalence" is also quite arbitrary, for the user's
> POV.  At least IME.

Here's a similar issue.  If I do an incremental search in Welsh text,
entering bac (on the way to entering bach) will find words like "bach"
and  "bachgen" even though their third letter is 'ch', not 'c'.

'Canonical equivalence' is 'DTRT', unless you're working with systems
too lazy or too primitive to DTRT.  It involves treating sequences of
character sequences declared to be identical in signification
identically.

The only pleasant justification for treating canonical sequences
inequivalently that I can think of is to treat the difference as a way
of recording how the text was typed.  Quite a few editing systems erase
that information, and I doubt people care how someone else typed the
text.

Richard.



More information about the Unicode mailing list