Character folding in text editors

Eli Zaretskii eliz at
Sun Feb 21 10:28:05 CST 2016

> From: Philippe Verdy <verdy_p at>
> Date: Sun, 21 Feb 2016 00:19:19 +0100
> Cc: unicode Unicode Discussion <unicode at>
>  Unless we have case folding tailored by language, you cannot do that based
> on the Unicode database alone.
> However CLDR provides tailored data about collation.
> From my point of view, it is just a matter or selecting the collation
> strength to use for searches using collation. All collations in CLDR are
> locale-dependant (the search algorithm must be using either a language
> preselection, or detect the default language used by the document, or set
> explicitly in specific fragments of the document, or use some hints to
> guess what could be the effective language), even if CLDR also defines a
> "root" locale for use in language-neutral contexts, or when the language
> cannot be determined from the document or its metadata.

Emacs doesn't (yet) have the notion of the "current language".  Being
a multi-lingual environment, where different languages are routinely
mixed in the same editing buffer, this is a hard problem that doesn't
yet have a solution.  Emacs does know the "charset" which the given
text came from, if the original was encoded in some telltale encoding,
like iso-2022-jp; it can also know the script of the text (by looking
at the Unicode block of the characters).  In some cases, this is
enough to deduce the language.  But in general, and notably with
languages that use the Latin script, this is not enough.  Using the
locale in which Emacs was started is insufficient in this age of
global communications.

Therefore, the goal of what is currently implemented in what will
become Emacs 25.1 in a few months was deliberately limited to begin
with: support only "language-independent" folding.  In a nutshell,
this means ignoring all the collating weights except the primary.

The implementation basically uses the decomposition data in
UnicodeData.txt.  How different is that from the "root locale" data
that is part of CLDR?  What are the differences?  Does the
implementation based on decomposition data have any merit, or is it
completely useless/wrong?

More information about the Unicode mailing list