How to disable Indic syllable form editing in MS word

Richard Wordingham via Indic indic at unicode.org
Sun Dec 10 05:56:55 CST 2017


On Sun, 10 Dec 2017 11:54:06 +0530
Shriramana Sharma via Indic <indic at unicode.org> wrote:

> I notice that in the case of some platforms/apps, for example the
> Firefox on Kubuntu 16.04 that I'm using right now, if I place the
> cursor before or after a cluster like क्ष्य and use the cursor keys,
> the visual cursor doesn't jump the cluster but traverses it
> progressively in N steps where N is one more than the number of
> viramas inside it. At each step, the logical cursor seems to be placed
> *after* a virama.
> 
> Are you saying the proposed update to UAX#29 is going to prohibit this
> behaviour? That may not be advisable. Why are they trying to do it?

The conspiracy view is that SE Asia has not been forgiven for the USA
losing the Vietnamese War, coupled in the case of USE with an
Abrahamist attack on the script of the Dharma.  (The 'Tham' in 'Tai
Tham' means 'Dharma'.)  A tradition of wearing turbans just magnifies
the offence.  You can combine this with the declaration that U+2060 WORD
JOINER does not indicate that text on either side is part of the same
word. That declaration is a threat to spell checkers, which depend on a
highly fallible word breaking algorithm to find word boundaries in the
first place.  Foreign names can be particularly awkward.

Contact with the personalities involved suggest that is actually the
cock-up theory that is true.

In this particular case, there are two paragraphs in UAX#29 that do
the damage:

"The Unicode Standard provides default algorithms for determining
grapheme cluster boundaries, with two variants: legacy grapheme
clusters and extended grapheme clusters. The most appropriate variant
depends on the language and operation involved. However, the extended
grapheme cluster boundaries are recommended for general processing,
while the legacy grapheme cluster boundaries are maintained primarily
for backwards compatibility with earlier versions of this
specification."

"An extended grapheme cluster is the same as a legacy
grapheme cluster, with the addition of some other characters. The
continuing characters are extended to include all spacing combining
marks, such as the spacing (but dependent) vowel signs in Indic
scripts. For example, this includes U+093F ( ि ) DEVANAGARI VOWEL SIGN
I. The extended grapheme clusters should be used in implementations in
preference to legacy grapheme clusters, because they provide better
results for Indic scripts such as Tamil or Devanagari in which editing
by orthographic syllable is typically preferred. For scripts such as
Thai, Lao, and certain other Southeast Asian scripts, editing by visual
unit is typically preferred, so for those scripts the behavior of
extended grapheme clusters is similar to (but not identical to) the
behavior of legacy grapheme clusters."

A case history is the addition of the 'prepend' class for the Tai
vowels that are encoded in visual order rather than phonetic order.
They had gc=Lo, and had been very accessible when editing.  When the
prepend class was added, editors started to treat preposed vowel plus
consonant as indivisible units once they had been entered, and there
were howls of protest from Thailand.  The key effect of the change was
withdrawn, with the preposed vowels reverting to having the grapheme
cluster break value 'other'.  For a while, there were no characters
with gcb=prepend.


> However, I should also note that while this behaviour seems quite
> sensible for C1 conjoining cases, it won't help to insert joiners to
> request C2-conjoining forms where the ZWJ needs to be put *before* the
> virama. For instance in Kannada to get RA + post-base YA as in ರ‍್ಯ
> the sequence is ರ, ZWJ, ್, ಯ. This can only be achieved in initial
> input as I said earlier, because post-input, the cursor will only be
> placed internally *after* the virama, and putting a ZWJ there just
> breaks the cluster like ZWNJ: ರ್‍ಯ. This is because there is no
> defined behaviour for Virama + ZWJ in Kannada.
> 
> But I presume Kannadigas can live with that (though I am not one
> myself) because such sequences aren't frequently used at all. (In fact
> most common users probably aren't aware that they exist.)

It's a case of a suboptimal system being better than nothing.  One can
position the cursor before the second consonant, delete *just* the
virama, and then type <ZWJ, VIRAMA>.  At no time do you lose the
consonants.  Scripts with consonant signs are not so lucky - the
consonant signs tend to be lost if the first consonant of the cluster is
mistyped.

One of the joys of Emacs for Tai Tham is that it allows one to delete
and replace the first consonant of cluster.  Northern Thai Tai Tham has
such glorious clusters as ᩉ᩠ᨶᩮᩬᩥ᩠᩵ᨿ <U+1A49 HIGH HA, U+1A60 SAKOT,
U+1A36 NA, U+1A6E SIGN E, U+1A6C SIGN OA BELOW, U+1A65 SIGN I, U+1A75
TONE-1, U+1A60 SAKOT, U+1A3F LOW YA> /nɯai/ 'to ache all over'.  (The
Siamese cognate is normally translated just as 'tired'.)  At present
that akshara is split into three grapheme clusters, composed of 2, 6
and 1 characters. (A user perception might split it into four
logically contiguous groups of 3, 3, 1 and 2 characters for onset,
vowel, tone and final consonant.) When the change goes through, this
will be just one extended grapheme cluster, and even harder to edit.

Richard.



More information about the Indic mailing list