Counting Devanagari Aksharas

Sat Apr 22 05:13:16 CDT 2017

On Fri, 21 Apr 2017 16:27:43 -0700
Manish Goregaokar via Unicode <unicode at unicode.org> wrote:

> > Do Hindi speakers really think of orthographic syllables as
> > characters?  
> 
> When rendered as a cluster, yes? I've asked around, and folks seem to
> insist on coupling it to the rendering.

That argues that it's a unit, which I don't think is in dispute.  Words
are also units, and nowadays we don't normally insist that one retype a
word just to change one bit of it.

> Given most fonts render
> *normal* (common, etc) clusters, I think making them EGCs and looking
> at nonrendered clusters the same way we do family emoji is fine
> (family emojis of length 5 are a single EGC, but that's not what's
> actually perceived by the user, but it's a use case that's very rare
> in the wild, so it doesn't matter).

That depends on the language.  In the Tai Tham script, even without
consonant clusters one can get 5 graphic characters in a syllable,
e.g. ᨧᩮᩢ᩶ᩣ _cao_ <HIGH CA, SIGN E, MAI SAT, TONE-2, SIGN AA> 'lord;
you (polite)', and when one adds consonant clusters one easily gets
monosyllables like ᨠᩖ᩠᩶ᩅ᩠ᨿ _kluai_ <HIGH KA, MEDIAL LA, SAKOT, WA,
TONE-2, SAKOT, LOW YA> 'banana' with 5 graphic characters and
additionally 2 coengs.  (One can distinguish Pali from the Tai
languages simply by the density of the ink!)

At present these are split into two and three grapheme clusters
respectively, and LibreOffice cursor movement responds accordingly.
(SIGN AA starts a grapheme cluster in several scripts of further
India.)  However, if one teaches the Emacs editor what a Tai Tham
syllable is, so that it can use the M17n rendering library, the cursor
then advances syllable by syllable, which is unpleasant for imperfect
typists.  Fortunately, it's possible to add functions to Emacs to allow
it to advance character-by-character; I forget if one has to also add a
few code changes.  (The downside is that text either side of the cursor
is rendered independently, which can be a nuisance when editing very
long lines.)

> The way I see it, the current
> system is wrong, and so would the proposed system of not breaking at
> viramas (or not breaking at viramas followed by a consonant if we want
> to be more precise), but the proposed system would be wrong much less
> often.

> I am only talking about Devanagari, though scripts like
> Bangla/Gujrati/Gurmukhi may have similar needs. Breaking on ZWNJ seems
> sensible.

Indeed, viramas (InSC=Virama) will have to be handled case-by-case.  One
should continue to break after pulli (U+0BCD TAMIL SIGN VIRAMA) except
for the cases of the ligatures/conjuncts.  I don't know if there are
obscure cases, or whether it's only _shri_ and <KA, SSA> for which one
should not break just because of the virama.  Continuation after coengs
(InSC=Invisible_Stacker) should be automatic.

Malayalam will need customisation.  Definitions by codepoints are only
a fallback, for when a font cannot be used to guide the process. 

Formally, normalisation is a problem, as these characters can be
separated from letters by other marks.  This is a problem in practice
for normalised text in Tai Tham.

Pure killers (InSC=Pure_Killer) should probably be given no special
treatment, as at present, by default, though I wonder if we should
define orthographic syllables for Pali in Thai script.  The two
orthographies will need different rules, and renderers won't help.
Defining orthographic syllables for languages in the Latin script is
probably excessive.

Richard.