Editing Sinhala and Similar Scripts

Richard Wordingham richard.wordingham at ntlworld.com
Sat Mar 22 19:16:44 CDT 2014


On Sat, 22 Mar 2014 23:37:49 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2014-03-22 20:50 GMT+01:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > > But it won't apply to "diacritics" (combining characters or joiner
> > > controls like CGJ, ZWK and ZWNJ, and possibly even some oher
> > > format
> > > controls) that have combining class 0 because their encoding
> > > order is significant to you know where to stop the effect of
> > > Backspace.
> >
> > Your approach recommends input methods that separate combining
> > marks of different combining classes by CGJ for easier editing!
> >
> 
> NO. I certainly do not recommend it ! This is a false assertion.

If one takes your approach to handling input, then one needs CGJ to ease
the correction of diacritics.  I am not saying that you recommend the
use of CGJ.

> > I see absolutely no reason why Backspace would arbitrarily delete
> > > only the last encoded character when users canno even count them
> > > and may not have input them separately. or could expect them to
> > > have be typed in a different order.
> > >
> > > So yes, entering:
> > > <CEDILLA DEADKEY, ACUTE DEADKEY, C, BACKSPACE>, or
> > > <ACUTE DEADKEY, CEDILLA DEADKEY, C, BACKSPACE>, or
> > > <ACUTE DEADKEY, C WITH CEDILLA, BACKSPACE>, or
> > > <CEDILLA DEADKEY, C WITH ACUTE, BACKSPACE>
> > > should all result in keeping only the letter C in the backing
> > > store.
> >
> > > And with a IME supporint Compose key this will also be true;
> >
> > > <COMPOSE, C, CEDILLA, ACUTE, BACKSPACE>, or
> > > <COMPOSE, C, ACUTE, CEDILLA, BACKSPACE>, or
> > > <COMPOSE, C WITH CEDILLA, ACUTE, BACKSPACE>, or
> > > <COMPOSE, C WITH ACUTE, CEDILLA, BACKSPACE>
> >
> > Your input methods suggest that there is something unitary about the
> > result - which makes sense if their output is U+1E08 LATIN CAPITAL
> > LETTER C WITH CEDILLA AND ACUTE.  Would you make the same arguments
> > if 'C' were replaced with 'S'?  There is no character LATIN CAPITAL
> > LETTER S WITH CEDILLA AND ACUTE.
> 
> I have NOT said that there existed such character (look at the
> separating commas).

I looked at the names.  Dead keys are effectively modifiers applied
beforehand rather than simultaneously, so there is no more reason for
the dead key sequences to generate more than one character than there
is for an ordinary key to generate multiple characters.

The use of 'COMPOSE' indicates that one is not simply entering a
sequence of characters.  'COMPOSE, C, CEDILLA, ACUTE' should mean
an input process different to simply 'C, COMBINING CEDILLA, COMBINING
ACUTE'.

> This is a pragmatic consideration, that canonical equivalence should
> also be respected even when editing texts. The same key should produce
> canonically equivalent text when editing at the same logical position
> texts that are canonincally equivalent.

That raises an interesting question.  Which positions in the string <RO
RUA, SARA UU, MAI THO> (ccc = 0, 103, 107) are logically the same
positions as which positions in the canonically equivalent string <RO
RUA, MAI THO, SARA UU>?  Are you saying that some positions are not
'logical'?  I for one would prefer to be able to access any position
within the string.  It is a shame there has been so little uptake of
the SIL Graphite split cursor approach, which attempted to address the
issue of editing clusters.

As to pragmatics, we are discussing editing with feedback.  If we have
full feedback, we do not need canonical equivalence to be respected.

> If
> an advanced IME is used to allow editing the content of a cluster
> before the cursor position, it will require a specific dialog to
> decompose the characters and render in the IME the cluster as a
> sequence of characters rendered isolately in "view controls mode").

It is not a good idea to tamper with the normalisation in the first
place.  The sequence of characters used may say quite a bit about how
the user thinks of the cluster.  Pragmatically, normalisation may also
degrade rendering - recall the efforts Microsoft went to to discourage
the normalisation of Korean text!

> Most text editors do not support such separate IME panel and in fact
> users do not like seeing these IME popups appearing on top of the
> edited text. They want to be able to inpute text diretly in the
> WYSIWIG window. The IME panel is an advanced edit mode which requires
> specific support in the application (and an integration similar to
> the panels used by spell checkers).

A separate IME panel is not the only approach.  Another approach is to
use a modified font in the region of the cluster so that it displays
clusters suitably, and then renders the whole region in the
WYSIWYG region according to the usual rules except that it applies the
font modification in the relevant region.

Richard.



More information about the Unicode mailing list