Character folding in text editors

Philippe Verdy verdy_p at wanadoo.fr
Sat Feb 20 11:27:41 CST 2016


Unless we have case folding tailored by language, you cannot do that based
on the Unicode database alone.

However CLDR provides tailored data about collation.

>From my point of view, it is just a matter or selecting the collation
strength to use for searches using collation. All collations in CLDR are
locale-dependant (the search algorithm must be using either a language
preselection, or detect the default language used by the document, or set
explicitly in specific fragments of the document, or use some hints to
guess what could be the effective language), even if CLDR also defines a
"root" locale for use in language-neutral contexts, or when the language
cannot be determined from the document or its metadata.



2016-02-20 11:23 GMT+01:00 Elias Mårtenson <lokedhs at gmail.com>:

> Hello Unicode,
>
> I have been involved in a rather long discussion on the Emacs-devel
> mailing list[1] concerning the right way to do character folding and we've
> reached a point where input from Unicode experts would be welcome.
>
> The problem is the implementation of equivalence when searching for
> characters. For example, if I have a buffer containing the following
> characters (both using the precomposed and canonical forms):
>
>     o ö ø ó n ñ
>
> The character folding feature in Emacs allows a search for "o" to mach
> some or even all of these characters. The discussion on the mailing list
> has circulated around both the fact that the correct behaviour here is
> locale-dependent, and also on the correct way to implement this matching
> absent any locale-specific exceptions.
>
> An English speaker would probably expect a search for "o" to match the
> first 4 characters and a search for "n" to match the latter two.
>
> A Spanish speaker would expect that n and ñ be different but otherwise
> have the same behaviour as the English user.
>
> A Swedish user would definitely expect o and ö to compare differently, but
> ö and ø to compare the same.
>
> I have been reading the materials on unicode.org trying to see if this
> has been specifically addressed anywhere by the Unicode Consortium, but my
> results are inconclusive at best.
>
> What is the "correct" way to do this from Unicode's perspective? There is
> clearly an aspect of locale-dependence here, but how far can the Unicode
> data help?
>
> In particular, as far as I can see there is no way that the Unicode charts
> can allow me to write an algorithm where o and ø are seen as similar (as
> would be expected by an English user).
>
> [1] https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00506.html
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160220/6948a200/attachment.html>


More information about the Unicode mailing list