Annoyances from Implementation of Canonical Equivalence

Richard Wordingham via Unicode unicode at unicode.org
Thu Oct 17 15:58:50 CDT 2019


On Thu, 17 Oct 2019 10:42:19 +0300
Eli Zaretskii via Unicode <unicode at unicode.org> wrote:

> > Date: Thu, 17 Oct 2019 02:26:35 +0100
> > From: Richard Wordingham <richard.wordingham at ntlworld.com>
> > Cc: Eli Zaretskii <eliz at gnu.org>
> > 
> > (c) A search for 'n' finding 'ñ'.
> > 
> > When it comes to canonical equivalence, one answer to (c) is that as
> > soon as one adds the next letter letter, e.g. 'na', the search will
> > no longer match 'ñ'.  
> 
> Sounds arbitrary to me.  How do we know that all the users will want
> that?

If the change from codepoint by codepoint matching is just canonical
equivalence, then there is no way that the ‘n’ of ‘na’ will be matched
by the ‘n’ within ‘ñ’.

> > (This doesn't apply to diacritic-ignoring folding.)  
> But the issue _was_ diacritic-ignoring folding.

Then we don't seem to have any evidence of user discontent arising from
supporting canonical equivalence.

> > That argument doesn't work with the Polish letter 'ń' though, as it
> > can be word-final.  

> It actually doesn't work in general, and one factor is indeed
> different languages.  The problem with ñ was raised by
> Spanish-speaking users, and only they were very much against folding
> in this case.

I'm not talking about folding.  I'm talking about canonical
equivalence, which largely but not solely consists of treating
precomposed characters as the same as their *canonical* decompositions. 

> > In many cases, the answer might be a search by collation graphemes,
> > but that has other issues besides language sensitivity.  

> It is also unworkable, because search has to work in contexts where
> the text is not displayed at all, and graphemes only exist at display
> time.

The definition of a grapheme cluster is given in Section 9.9 of UTS#10,
which is currently at Version 12.1.0.  It is only connected to display
at a deep level, so display time is irrelevant.  Formally, it depends
on a collation, though the sorting aspect is irrelevant and is removed
for many 'search' collations in the CLDR.

So, if one were using a Spanish collation, on typing 'n' into the
incremental search string (and having it committed), the search wouldn't
consider a match with 'ñ'. Then, on further typing the combining tilde,
it would reject the matches it had found and choose those matches with
'ñ', whether one codepoint or two.  Would that behaviour cause serious
grief for incremental search?  As I use an XSAMPA-based input
implemented in quail that attempts to generate text in form NFC, I would
type 'n~' to get the Spanish character, and so would never get an
intermediate state where the incremental search was searching for 'n'.
(At least, not in Emacs 25.3.1.)

Richard.



More information about the Unicode mailing list