Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

Richard Wordingham via Unicode unicode at unicode.org
Fri Dec 22 16:56:53 CST 2017


On Thu, 21 Dec 2017 22:04:37 -0800
Manish Goregaokar via Unicode <unicode at unicode.org> wrote:

> > When deleting by backspace, the usual practice is to delete one
> > Unicode  
> character for each key press.
> 
> This seems to depend on the operating system and program involved. For
> example, on OSX any native text input field (Spotlight, TextEdit,
> etc) will delete by extended grapheme cluster. Chrome also deletes by
> extended grapheme cluster.

That seems nasty, even for Thai with its consonant + vowel + tone
legacy grapheme clusters.  Or does Thai get special treatment?  iPhone
messages shows the normal (mandated?) Thai behaviour of deleting
character by character. Do you not find this mass deletion annoying
for Hindi aksharas with anusvara?

> However, Firefox deletes by code point. Or, more accurately, something
> codepoint-like. Backspace will delete flag emoji wholesale, but will
> delete the jamos in `ᄀᄀᄀ각ᆨᆨ` (a single EGC) one at a time. It
> also deletes the variation selector and the heart in `��‍❤️‍��` in a
> single keystroke. There's probably a simple metric being used here,
> but I haven't looked into it yet.

There are some odd behaviours around.  Claws-mail, which I think uses
straight GTK2, has been changing its treatment of Latin diacritics.
Long ago, if I remember correctly, it treated 'e acute' differently
depending on whether it was one or two codepoints, then it started
converting text to NFC on input, and now it treats the NFC and NFD
sequence <x, U+030C COMBINING CARON> as though it were a single
codepoint. This might be using the property 'diacritic', but it isn't
treating Thai tone marks that way, so I'm guessing.  Presumably it's
been implemented on the principle that the user should not receive any
pleasant surprises.

> -----------
> Overall it seems like there's a different preference for forming
> clusters in different scripts. Perhaps we should have a specific
> "cluster forming virama" category for viramas from scripts that
> almost always prefer clusters? (e.g. devanagari). IIRC some indic
> scripts prefer explicit virama rendering.

The denial of "one size fits all" is appropriate within writing systems
as well as across all systems.  For example, using grapheme clusters as
the unit of matching may generally work well, but is a total disaster
in Indic if one needs to replace one vowel by another, as in
Hariraama's plea for help on the Indic list on 6 December.

Moving the editing position within a cluster is another issue.
Sometimes one needs to adjust the type of joining of consonants in an
akshara, e.g. to give a Devanagari text a Hindi look even if the text
gets displayed with a Sanskrit font.  This is where the font-dependent
interpretation of a virama is a disaster.  For Devanagari it might have
been better if there had been three different characters for the two
types of joining (half-forms on one hand and conjuncts or repha on the
other) and one type of non-joining, the visible virama.  Instead, it
seems that people rely on the appropriate gaps in the font capability.
The nettle was grasped for the Myanmar script, which now has an
invisible stacker, a pure vowel killer, and a composite code for the
repha-type combination.

I really do find it hard to believe that it is considered to be bad to
correct a single consonant in the middle of an akshara.

I an not persuaded that the users of languages with many
multi-consonant aksharas think that each distinct akshara is a
different character.  The akshara lies in a hierarchy, between grapheme
cluster and pada patha word.  What is needed is an extra level of
cursor motion, between the levels of word and grapheme cluster.

Thai also shows different levels of division.  For horizontal spacing,
the unit is indeed the grapheme cluster.  However, looking at
dictionary published in 1971, I noticed that a few marks above are
conditionally placed between the grapheme cluster.  The primary
examples are MAI HANAKAHT and MAI THO, neither of which the
Thais considered a vowel back in 1892.  (Michell's dictionary of 1892
is apologetic about treating the former as a vowel.)  I haven't noticed
this behaviour in 20th century Thai with characters separated by several
character widths, though both these marks tend to be placed in the
rightmost part of the space allocated to the base consonant.  Correct
positioning of MAI THO seems to require a grammatical analysis, and the
documentation of Uniscribe certainly used to suggest that this was not
possible at the font level.

Vertical writing in Thai is extremely rare.  There are Thai
crosswords, and they do use the grapheme clusters.  Many, but not all,
of the examples use an irregular pentagon to accommodate marks above and
a different irregular pentagon to accommodate marks below.  Thais seem
better acquainted with Scrabble played in English.

Commercial vertical signs follow a different grammar.  I have two
examples, segmented ยา-มา-หา 'Yamaha', and วิ-กี-โอ 'video', the
latter typically accompanied by V-D-O in Roman letters. It is not clear
whether these words are split into super-extended grapheme clusters or
syllables.

When it comes to line-breaking, the Thai preference for emergency
line-breaks (which are supposed to be beyond the scope of Unicode) seems
to be for division into syllables.  This seems to be the standard for
Lao line-breaking, though it might be connected with the facts that
syllable boundaries are easier to detect with modern Lao spelling and
that there are far fewer users of Lao than of Thai.

When it comes to detecting aksharas in Tamil the situation seems to be
rather simple.  In two environments, U+0BCD TAMIL SIGN VIRAMA behaves
like an invisible stacker.  Otherwise, it behave like a pure killer.

For Malayalam, the two writing syles have different behaviours for
the virama.  Disunification of U+0D4D MALAYALAM SIGN VIRAMA is probably
not an option.  In theory, one could try demanding that ambiguous
intentionally visible virama be spelt with ZWNJ, but I doubt that such
a command would be heeded.  

For Sinhalese, it may be that there is no ambiguity in the effect of
virama, provided one is aware that the current Unicode prescription is
contrary to the rules laid down by the government of Sri Lanka.  Note
that the recent W3C investigation of Indic layout requirements was
restricted to *Indian* Indic scripts.  Does anyone here know what a
Burmese 'dropped capital' looks like?  The investigation did not cover
Insular Southeast Asia, where there are characters of Indic syllabic
category virama.  The coding of mainland Southeast Asian Indic scripts
has evolved beyond the virama, using an invisible stacker and a pure
killer instead. The ISCII stage is, so far as I am aware, restricted to
India and Sri Lanka.

Richard.



More information about the Unicode mailing list