Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

Mon Dec 11 05:56:51 CST 2017

On Mon, 11 Dec 2017 08:59:20 +0100
Mark Davis ☕️ via Unicode <unicode at unicode.org> wrote:

> The proposed rules do not distinguish the different visual forms that
> a sequence of characters surrounding a virama can have, such as
> 
>    1. an explicit virama, or
>    2. a half-form is visible, or
>    3. a ligature is created.

Do you mean 'visible virama' by an 'explicit virama'?  In the context
of the Indic syllabic category of virama (which is what I think of as
the Unicode virama), I would expect 'explicit virama' to refer to the
sequence <virama, ZWNJ>.  (In several scripts, this is encoded as a
separate character, and usually classified as a 'pure killer'.)

> That is following the requested structure in
> http://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf.
> 
> So with these rules a ZWNJ (see Figure 12-3. Preventing Conjunct
> Forms in Devanagari
> <http://www.unicode.org/versions/Unicode10.0.0/ch12.pdf#G14632>)
> doesn't break a GC, nor do instances where a particular script always
> shows an explicit virama between two particular consonants.

Actually, I don't see ZWJ or ZWNJ in this document.  A literal reading
of the document would see a syllable break after an explicit half-form!

> All the
> lines on Figure 12-7. Consonant Forms in Devanagari and Oriya
> <http://www.unicode.org/versions/Unicode10.0.0/ch12.pdf#G59257>
> having a virama would have single GCs (that is, all but the first
> line). [That, after correcting the rules as per Manish Goregaokar's
> feedback, thanks!]

That looks like a change of intent.  For NFD text in Indian Indic blocks
plus control characters, in Version 11.0 Draft 1, ZWNJ does stop a gcb
virama from including the next consonant in an extended grapheme
cluster. 

> The examples in "Annexure B" of 17200-text-seg-rec.pdf
> <http://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf> clearly
> include #2 and #3, but don't have any examples of #1 (as far as I can
> tell from a quick scan). It would be very useful to have explicit
> examples that included #1, and included scripts other than Devanagari
> (+swaran, others).

There aren't any examples of explicitly encoded half-forms (C1 or C2)
or explicitly encoded viramas, either.  It would be good to have
examples of visible viramas in conjunction with preposed vowels, such
as U+093F DEVANAGARI VOWEL SIGN I. From Paul Nelson's remarks many
years ago, I gather there are language-dependent variations in their
placement when the halant appears.

A bit of Sanskrit would be nice to see as well.  Hindi and Sanskrit
have different preferred shapes for several consonant clusters.  Some
Tamil script Sanskrit shlokas would be good, as well.

Richard.