Normalise Tai Tham or not?
Richard Wordingham via Unicode
unicode at unicode.org
Wed Oct 11 16:01:32 CDT 2017
On Wed, 11 Oct 2017 13:10:26 +0300
Eli Zaretskii via Unicode <unicode at unicode.org> wrote:
> > Date: Tue, 10 Oct 2017 21:51:55 +0100
> > From: Richard Wordingham via Unicode <unicode at unicode.org>
> > > Emacs lately introduced character-folding in searches, but it's
> > > turned off by default, as many users objected.
> > I don't see how that helps with this problem. If I search for the
> > Northern Thai word /kin/ with the low tone, which means 'smell', I
> > want to find it whichever way round SAKOT and TONE-1 are, and I
> > don't want to find /kin/ with the rising tone, which is implied by
> > having no tone mark and means 'to eat'.
> That's what this feature is supposed to allow, see char-fold.el in the
> Emacs sources.
I downloaded Emacs 25.3.1, and set variable search-default-mode to
"char-fold search". Then, as intended, an incremental search for the
one character string <U+00E1 LATIN SMALL LETTER A WITH ACUTE> found
the string <U+0061, U+0301 COMBINING ACUTE ACCENT>.
The description I had found undersold the noble intention. Instead,
having looked at the code, I can see why it should handle the problem of
search text and text string being normalised differently - in my
example, an NFC search string being used on NFD text.
Unfortunately, it doesn't work in general with unnormalised text.
The NFC and NFD sequence กุ่ <U+0E01 THAI CHARACTER KO KAI, U+0E38 THAI
CHARACTER SARA U, U+0E48 THAI CHARACTER MAI EK> is canonically
equivalent to <U+0E01, U+0E48, U+0E38>, but the pair provides an example
of the failure to match, in both directions.
Thai computing originally dealt with the problem by setting up input
rules which prevent one from entering what is now the unnormalised form.
The email client I use won't let me type in the unnormalised form -
text is converted to NFC on input, both as email text and in search
strings. (Latin text and Tai Tham also get normalised on input - this
is not special treatment for Thai.) Emacs seems to deal with the issue
for Thai by misrendering the unnormalised form.
Compulsive normalisers do strengthen the argument for the spell checker
standardising ion the normalised form.
Incidentally, another example of an editor that won't match canonically
equivalent strings is Word in Microsoft Office Standard 2010 - I tried
it with the Tai Tham pair.
More information about the Unicode