Character folding in text editors

Elias Mårtenson lokedhs at gmail.com
Sat Feb 20 04:23:13 CST 2016


Hello Unicode,

I have been involved in a rather long discussion on the Emacs-devel mailing
list[1] concerning the right way to do character folding and we've reached
a point where input from Unicode experts would be welcome.

The problem is the implementation of equivalence when searching for
characters. For example, if I have a buffer containing the following
characters (both using the precomposed and canonical forms):

    o ö ø ó n ñ

The character folding feature in Emacs allows a search for "o" to mach some
or even all of these characters. The discussion on the mailing list has
circulated around both the fact that the correct behaviour here is
locale-dependent, and also on the correct way to implement this matching
absent any locale-specific exceptions.

An English speaker would probably expect a search for "o" to match the
first 4 characters and a search for "n" to match the latter two.

A Spanish speaker would expect that n and ñ be different but otherwise have
the same behaviour as the English user.

A Swedish user would definitely expect o and ö to compare differently, but
ö and ø to compare the same.

I have been reading the materials on unicode.org trying to see if this has
been specifically addressed anywhere by the Unicode Consortium, but my
results are inconclusive at best.

What is the "correct" way to do this from Unicode's perspective? There is
clearly an aspect of locale-dependence here, but how far can the Unicode
data help?

In particular, as far as I can see there is no way that the Unicode charts
can allow me to write an algorithm where o and ø are seen as similar (as
would be expected by an English user).

[1] https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00506.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160220/cbf650dd/attachment.html>


More information about the Unicode mailing list