Editing Sinhala and Similar Scripts

Philippe Verdy verdy_p at wanadoo.fr
Sat Mar 22 03:41:40 CDT 2014


2014-03-22 1:04 GMT+01:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Thu, 20 Mar 2014 05:59:49 +0100
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>  Not all Indic diacritics have combining class 0, and Hebrew diacritics
> have non-zero combining classes.
>

Did I say something else ? You have probably misread me. I have
written "distinct
and non-zero"

You forgot the term "AND" which is important as it gives the condition
where combining characters may be reordered during normalization, and so
that their relative encoding order is unpreditable (independantly of the
fact that they may be precomposed).

So if you enter <C, CEDILLA, ACUTE> or <C, ACUTE, CEDILLA>, you get in the
editor's backing store some encoding form (which my be precombined or not,
or with diacritics not necessarily in the normalized form, and all these 4
possible encodings are canonically equivalent): they if you press
Backspace, the effect should also not depend on whever you just entered
these keystroke or if you loaded the text and clicked after the sequence
before pressing backspace: How can you predict which character to remove ?

That why here it should delete BOTH the CEDILLA and the ACUTE, because they
are using distinct and non-zero combining classes, and so are unordered.

The relationale would be true as well for Hebrew points (most of them use
distinct non-zero compbining classes when they are used in sequences).

But it won't apply to "diacritics" (combining characters or joiner controls
like CGJ, ZWK and ZWNJ, and possibly even some oher format controls) that
have combining class 0 because their encoding order is significant to you
know where to stop the effect of Backspace.

I see absolutely no reason why Backspace would arbitrarily delete only the
last encoded character when users canno even count them and may not have
input them separately. or could expect them to have be typed in a different
order.

So yes, entering:
<CEDILLA DEADKEY, ACUTE DEADKEY, C, BACKSPACE>, or
<ACUTE DEADKEY, CEDILLA DEADKEY, C, BACKSPACE>, or
<ACUTE DEADKEY, C WITH CEDILLA, BACKSPACE>, or
<CEDILLA DEADKEY, C WITH ACUTE, BACKSPACE>
should all result in keeping only the letter C in the backing store.

And with a IME supporint Compose key this will also be true;

<COMPOSE, C, CEDILLA, ACUTE, BACKSPACE>, or
<COMPOSE, C, ACUTE, CEDILLA, BACKSPACE>, or
<COMPOSE, C WITH CEDILLA, ACUTE, BACKSPACE>, or
<COMPOSE, C WITH ACUTE, CEDILLA, BACKSPACE>

Canonical equivalence should be respected in visual editing modes. Deleting
only the "last" encoding diacritic should only be done in specific
non-visual editing modes (with "visible controls") and it is not expected
that most users will like this editing mode.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140322/1bc19c01/attachment.html>


More information about the Unicode mailing list