L2/18-181

Richard Wordingham via Unicode unicode at unicode.org
Wed May 16 19:24:11 CDT 2018


On Wed, 16 May 2018 17:41:12 -0500
Anshuman Pandey via Unicode <unicode at unicode.org> wrote:

> > 3. Keyboard design is more difficult because consonants like ক্ষ
> > are encoded as conjunct forms instead of atomic characters.  
> 
> Ignorant question on my part: is it difficult to use character
> sequences as labels for keys? I see keys for both क्ष and ज्ञ on the
> iOS Hindi keyboard, and त्र is tucked away under त.

It can be.  It depends on the technology.

Pure X seems to be the worst.  At the basic level, one has a
bewildering map of key plus active modifier key to a single
Unicode character. (The space also include function keys.)  An
*application* can map keys to strings, but I know of no way of doing
that to all of a user's applications, both those running and those that
will run.  Even the logic for dead keys has to be applied by the
application, though I believe there are standard libraries that will
handle that.

The old method on Windows uses sets of data tables that may be termed
keyboards.  Populated sets are saved as DLLs, and there are limits on
what they can contain.  Windows' Microsoft Keyboard Layout Creator
(MSKLC) is a popular tool for creating and packaging these DLLs.
A key plus it modifiers can be mapped to:

1) A sequence of UTF-16 code units.  The documented limit is, I believe
4, but there are reports of people being able to use 6.  The four
sequences listed above each constitute a sequence of 3 code units, so
they can be readily accommodated.  This technique may well not work
for a script in the SMP, and I think one cannot use the MSKLC simply to
create the DLLs storing long sequences.  So here is an added layer of
complexity, though not relevant to the Bengali script.

2) A key can be designated a 'dead key'.  I think it has to have a
fallback to a BMP character, or rather, a single UTF-16 code unit.  On
then pressing a key that maps to a single code unit, this is converted
to another single code unit, which is the character that the
combination types.  The restriction is built into the data structure.

There is a technique to chain dead keys, but that is not relevant to
the difficulty or ease of typing ligatures.

The next level up I am acquainted with is the level of input methods.
Here, one types a sequence of characters on a 'simple' keyboard, and
this sequence controls the derivation of characters being input to the
application.  Modifier keys may be available to influence this
derivation.  Now, some of these input methods may be unreliable, and
there may be problems for users who can switch between simple
keyboards, e.g. US and British, or US and Hindi.

If this type of method works, then inputting sequences in response to
a single keystroke is not a problem.  Multiple key strokes can be a
different matter, as the interface with applications may be ill-defined
or broken.  I have found this a problem with using the backslash key to
cycle through candidate characters, and deleting SMP characters in
LibreOffice has in the past resulted in the creation of lone surrogates.

Now, writing these input methods can be easy.  I have fairly simple
input methods for inputting both true characters and sequences
perceived as characters for Emacs, ibus (using KMfL) and fcitx (using
M17n).  However, the ibus method has been unreliable in the past, and I
have fallen back to a simple X keyboard map.  When I do that, I lose
the ability to input sequences by a single keystroke.

Richard.



More information about the Unicode mailing list