Editing Sinhala and Similar Scripts

Philippe Verdy verdy_p at wanadoo.fr
Sat Mar 22 21:32:06 CDT 2014


2014-03-23 1:16 GMT+01:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Sat, 22 Mar 2014 23:37:49 +0100
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> > 2014-03-22 20:50 GMT+01:00 Richard Wordingham <
> > richard.wordingham at ntlworld.com>:
> >
> > > > But it won't apply to "diacritics" (combining characters or joiner
> > > > controls like CGJ, ZWK and ZWNJ, and possibly even some oher
> > > > format
> > > > controls) that have combining class 0 because their encoding
> > > > order is significant to you know where to stop the effect of
> > > > Backspace.
> > >
> > > Your approach recommends input methods that separate combining
> > > marks of different combining classes by CGJ for easier editing!
> > >
> >
> > NO. I certainly do not recommend it ! This is a false assertion.
>
> If one takes your approach to handling input, then one needs CGJ to ease
> the correction of diacritics.  I am not saying that you recommend the
> use of CGJ.
>
> > > I see absolutely no reason why Backspace would arbitrarily delete
> > > > only the last encoded character when users canno even count them
> > > > and may not have input them separately. or could expect them to
> > > > have be typed in a different order.
> > > >
> > > > So yes, entering:
> > > > <CEDILLA DEADKEY, ACUTE DEADKEY, C, BACKSPACE>, or
> > > > <ACUTE DEADKEY, CEDILLA DEADKEY, C, BACKSPACE>, or
> > > > <ACUTE DEADKEY, C WITH CEDILLA, BACKSPACE>, or
> > > > <CEDILLA DEADKEY, C WITH ACUTE, BACKSPACE>
> > > > should all result in keeping only the letter C in the backing
> > > > store.
> > >
> > > > And with a IME supporint Compose key this will also be true;
> > >
> > > > <COMPOSE, C, CEDILLA, ACUTE, BACKSPACE>, or
> > > > <COMPOSE, C, ACUTE, CEDILLA, BACKSPACE>, or
> > > > <COMPOSE, C WITH CEDILLA, ACUTE, BACKSPACE>, or
> > > > <COMPOSE, C WITH ACUTE, CEDILLA, BACKSPACE>
> > >
> > > Your input methods suggest that there is something unitary about the
> > > result - which makes sense if their output is U+1E08 LATIN CAPITAL
> > > LETTER C WITH CEDILLA AND ACUTE.  Would you make the same arguments
> > > if 'C' were replaced with 'S'?  There is no character LATIN CAPITAL
> > > LETTER S WITH CEDILLA AND ACUTE.
> >
> > I have NOT said that there existed such character (look at the
> > separating commas).
>
> I looked at the names.  Dead keys are effectively modifiers applied
> beforehand rather than simultaneously, so there is no more reason for
> the dead key sequences to generate more than one character than there
> is for an ordinary key to generate multiple characters.
>
> The use of 'COMPOSE' indicates that one is not simply entering a
> sequence of characters.  'COMPOSE, C, CEDILLA, ACUTE' should mean
> an input process different to simply 'C, COMBINING CEDILLA, COMBINING
> ACUTE'.
>

Here again you reinterpret what I did not say. When U used DEADKEY or
COMPOSE, I was evidently refering to keystrokes, not characters. So I did
not imply any encoding of characters (I was clear enough to say that
these sequences of keystrokes was allowed to generate any canonically
equivalent encoding), so instrad I described the input (on keyboard or IME)
and the expected output (an encoded text that should be canonically
equivalent).

I have NOWHERE intended to force the use of CGJ (you seem to imply that
these keys will generate separate combining diacritics/joiners, one or two,
for each key...

This is wrong, the IME or keyboard driver handles the state of keystrokes,
even if you use a COMPOSE key or a DEAD KEY, this does not matter, and so
it won't feed the encoded text with streams of characters as long as the
state is not complete enough:

In fact this input with a compose key does not work:
COMPOSE, C, CEDILLA, ACUTE
simply because the composed sequence is areaddy terminated after the
cedilla modifier key. So when you would type the acute modifier key it
would not be associated. That's another reson why dead keys are working:
the state is not complete as long as you have not *finally* input the base
letter. But let's suppose that the driver must generate something, then for
the ACUTE key it would need to output the combining character, possibly
with a preceding CGJ if the intent is to have the acute accent ordered
relatively with the cedilla (this is very unusual).

In most usages, by far, diacritics never need any preceding CGJ to preserve
their relative ordering: it is almost never the case for diacrititcs that
have distinct non-zero combining classes. The rare cases occur however in
classical pointed Hebrew.

For this reason the keyboard driver will likely include a separate key
mapping for the CGJ, either
- as a base key entered after the diacritic deadkey, to force the ouput of
CGJ+diacritic characters ; or
- as a sequence with COMPOSE+diacritic key, without any key for the
intermediate base letter, to produce the same ouput.
In the first case (driver with dead keys), you need a single keyboard
mapping for the CGJ working as a dead key. In the second case (driver with
compose key), you use the COMPOSE key mapping only, but you still need to
map positions for the second base key (in the 3-key compose sequence) meant
to represent diacritics.

The effect of Backspace entered just after it would delete simulatenously
CGJ and the diacritic characters. It does not need to depend on the input
state of the driver or the IME. In all cases, nothing in the keyboard
mapping or IME will generate a CGJ character isolately, ir will be always
followed by something.

But what would happen if you would type the compose sequence generating CGJ
with COMPOSE where you forget to press the initial base letter, or type
COMPOSE after the base letter ?
  C, COMPOSE, ACUTE
you get the characters <C,  CGJ, combining ACUTE> you cannot type another
CEDILLA after it without pressing COMPOSE again before it, to get <C,  CGJ,
combining ACUTE, CGJ, combining CEDILLA>.
The result is clearly abusing the use of CGJ when the input output should
just be canonically equivalent to
<C,  combining ACUTE, combining CEDILLA> (i.e. without any CGJ at all)

Your system would be even less meaningful, it would break in most renderers
and spell checkers. It would break in IDNA domain names. it would not match
in plain text search unless they are tuned so that ther collators discard
the CGJs to look for fuzzy matches (fuzzy matches would also look for
strings that are compatibility equivalent under NFKD, or could search at
collation levels 2, or at collation level 1 ignoring all diacritics and CGJ
wherever they are).

So compose keys cause more confusion to native users than dead keys that
are smarter as they can record more internal states and also allow
arbitrary order of input for unordered diacritics (like acute plus cedilla
: you can press their dead key in any order, the IME or driver handles the
case and generates them, preferably in canonical order with growing
combining classes; the drive or IME alos generates them in an input state
where it also knows the base letter to ouput, it can precombine the
diacritics and so it will output C WITH CEDILLA, followed by COMBINING
ACUTE, as expected, and still without needing any CGJ).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140323/651debb2/attachment.html>


More information about the Unicode mailing list