Editing Sinhala and Similar Scripts

Marc Durdin marc at keyman.com
Sun Mar 23 17:46:49 CDT 2014

All the Keyman products -- on Windows, web, iOS and Android, as well as KMFL, which is a port of Keyman, work on the principle of modifying the text buffer directly.  There is no intermediate compose buffer.  For Indic and western scripts this works pretty well; the compose buffer which is a feature of IMEs does not fit these scripts cleanly in my experience.  It is often hard to know when a text entry is 'complete' for committing the compose buffer, and one effect is that the compose buffer tends to get very long, which makes accidental cancellation of input a common and frustrating issue.

The most obvious backspace intelligence I've seen in use is around handling NFC vs NFD text.  It is confusing to the end user if backspace sometimes deletes a whole character + diacritic, and sometimes just the diacritic mark.  For example, Vietnamese text has suffered from this issue with the varying composition schemes we've seen enforced by limited input methods.

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham
Sent: Monday, 24 March 2014 12:07 AM
To: unicode at unicode.org
Subject: Re: Editing Sinhala and Similar Scripts

On Sun, 23 Mar 2014 03:32:06 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> This is wrong, the IME or keyboard driver handles the state of 
> keystrokes, even if you use a COMPOSE key or a DEAD KEY, this does not 
> matter, and so it won't feed the encoded text with streams of 
> characters as long as the state is not complete enough:

This is certainly not true of Keyman for Linux (KMFL), and I don't believe it is true of Tavultesoft Keyman for Windows either.  This does require that the input method have a way of cancelling previously provided input.  Now, if you use a method with a COMPOSE key or a DEAD key, you are generally unlikely to get tentative entries.
However, one could write an input method that simulated a dead key but actually generated an output for it so as to imitate a typewriter differently.

> The effect of Backspace entered just after it would delete 
> simulatenously CGJ and the diacritic characters. It does not need to 
> depend on the input state of the driver or the IME. In all cases, 
> nothing in the keyboard mapping or IME will generate a CGJ character 
> isolately, ir will be always followed by something.

If backspace is not modified by the input method - and Marc Durdin has suggested that the input method should sometimes modify it - its effect will depend on the process controlling the backing store, which in general will work with multiple input methods, even during the course of a single editing session.  You might not write an input method that generates a single CGJ, but I do.  Do you insist on a soft hyphen when writing 'Llangollen' so that it will collate after 'Llanberis' in Welsh?  (I typed the place names in English; the names are spelt the same way in English and Welsh in hardcopy, though of course the letter counts differ.)

More information about the Unicode mailing list