Editing Sinhala and Similar Scripts

Richard Wordingham richard.wordingham at ntlworld.com
Sat Mar 22 14:50:56 CDT 2014


On Sat, 22 Mar 2014 09:41:40 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2014-03-22 1:04 GMT+01:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
 
> So if you enter <C, CEDILLA, ACUTE> or <C, ACUTE, CEDILLA>, you get
> in the editor's backing store some encoding form (which my be
> precombined or not, or with diacritics not necessarily in the
> normalized form, and all these 4 possible encodings are canonically
> equivalent): they if you press Backspace, the effect should also not
> depend on whever you just entered these keystroke or if you loaded
> the text and clicked after the sequence before pressing backspace:
> How can you predict which character to remove ?

If I entered those three characters in NFD order, I would expect to
remove the ACUTE.  I would annoyed to find the string reduced to
just C, and am annoyed to find it completely deleted.

I do not find consistently poor service to be better than frequently
poor service.

> The relationale would be true as well for Hebrew points (most of them
> use distinct non-zero compbining classes when they are used in
> sequences).

> But it won't apply to "diacritics" (combining characters or joiner
> controls like CGJ, ZWK and ZWNJ, and possibly even some oher format
> controls) that have combining class 0 because their encoding order is
> significant to you know where to stop the effect of Backspace.

Your approach recommends input methods that separate combining
marks of different combining classes by CGJ for easier editing!

> I see absolutely no reason why Backspace would arbitrarily delete
> only the last encoded character when users canno even count them and
> may not have input them separately. or could expect them to have be
> typed in a different order.
> 
> So yes, entering:
> <CEDILLA DEADKEY, ACUTE DEADKEY, C, BACKSPACE>, or
> <ACUTE DEADKEY, CEDILLA DEADKEY, C, BACKSPACE>, or
> <ACUTE DEADKEY, C WITH CEDILLA, BACKSPACE>, or
> <CEDILLA DEADKEY, C WITH ACUTE, BACKSPACE>
> should all result in keeping only the letter C in the backing store.
 
> And with a IME supporint Compose key this will also be true;
 
> <COMPOSE, C, CEDILLA, ACUTE, BACKSPACE>, or
> <COMPOSE, C, ACUTE, CEDILLA, BACKSPACE>, or
> <COMPOSE, C WITH CEDILLA, ACUTE, BACKSPACE>, or
> <COMPOSE, C WITH ACUTE, CEDILLA, BACKSPACE>

Your input methods suggest that there is something unitary about the
result - which makes sense if their output is U+1E08 LATIN CAPITAL
LETTER C WITH CEDILLA AND ACUTE.  Would you make the same arguments if
'C' were replaced with 'S'?  There is no character LATIN CAPITAL
LETTER S WITH CEDILLA AND ACUTE.

It will be distinctly unpleasant and unnatural with an input method
that allows separate input of all three characters - C,
COMBINING CEDILLA and COMBINING ACUTE - one by one.  Your suggestion
that typing THAI CHARACTER RO RUA, THAI CHARACTER SARA UU, THAI
CHARACTER MAI THO, BACKSPACE should result in just THAI CHARACTER RO RUA
is unlikely to be welcome to Thais.

I believe our sharply opposing opinions arise because of different
views of the clusters.  You are seeing characters that are composed of
multiple elements.  I am seeing groups of characters that, in general,
happen not to be arranged in a line of constant direction. 

> Canonical equivalence should be respected in visual editing modes.
> Deleting only the "last" encoding diacritic should only be done in
> specific non-visual editing modes (with "visible controls") and it is
> not expected that most users will like this editing mode.

For users who know what characters should be there, it makes a lot of
sense to enter a non-visual editing mode - ideally of limited scope
- when editing a previously typed cluster.

Richard.



More information about the Unicode mailing list