Character folding in text editors

Doug Ewell doug at ewellic.org
Sun Feb 21 11:53:23 CST 2016


Eli Zaretskii wrote:

>> About the closest approximation you can get using Unicode data alone
>> (not CLDR) is to normalize to NFD, then ignore the combining
>> diacritics.
>
> This is what Emacs currently does, IIUC what you say.  The NFD
> normalization uses the decomposition data included with
> UnicodeData.txt.  Is this what you mean?

Yes, the sixth field from the left. For 00F1 this is 006E 0303, so you 
ignore the 0303 and fold 00F1 to 006E.

Remember that the decompositions in UnicodeData.txt may contain other 
precomposed characters, so you have to apply this process iteratively:

1EA8 -> 00C2 0309
00C2 -> 0041 0302
so you fold 1EA8 to 0041.

>> But that still doesn't work for a character like ø, which doesn't
>> decompose to o + anything
>
> Why doesn't it, btw?  Same question about ł.
>
> I've heard an opinion that UnicodeData.txt only included
> decompositions when the combining mark's glyphs don't overlap those of
> the basic character.  Is that correct?

This sounds like a great question for Ken Whistler. ☺

>> and more importantly, it still won't meet expectations because of
>> the n/ñ and o/ö/ø language-dependency problems.
>
> Given that the feature can be turned off easily, do you think that it
> will nonetheless be useful, even though language-dependent parts are
> not available?

It's probably a lot better than no folding. Just be prepared for the 
inevitable complaints from speakers of language X. Users tend to expect 
features like this to be perfect, even when you warn them.

--
Doug Ewell | http://ewellic.org | Thornton, CO ���� 



More information about the Unicode mailing list