Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

Mark Davis ☕️ via Unicode unicode at unicode.org
Mon Jan 22 00:34:12 CST 2018


I was looking the feedback in http://www.unicode.org/review/pri355/, and
didn't see yours there. Could you please file your feedback there? (Nothing
on this list is tracked by the committee...)


FYI, I'm thinking now that the change should be:

GB9c: (Virama | ZWJ )   × LinkingConsonant
=>
GB9c: (Virama ViramaExtend* | ZWJ ) × LinkingConsonant

where ViramaExtend = [Extend - Virama - \p{ccc=0}]
(This is pre-partitioning.)

That is close to your formulation, but for for canonical equivalence, there
shouldn't need to allow the ViramaExtend after ZWJ, because the ZWJ has
ccc=0, and thus nothing reorders around it.

Cibu also pointed out on a different thread that for Malayalam we need to
consider a couple of other forms:

... Following contexts should be allowed for requesting reformed or
traditional conjuncts as per Unicode10.0.0/ch12 page 505.  ...

/$L ZWNJ $V $L/
/$L ZWJ $V $L/

The ZWJ Virama sequence is already provided for by the combination of GB9
& GB9c. But not the ZWNJ. If we want to handle that, it would mean the
addition of something like:

GB9d: × (ZWNJ ViramaExtend* Virama)

Cibu also wrote:


Also, when we disallow /$L $V ZWJ $D/, it is disallowing the sequences
involving legacy chillus. That is, for example, <CHILLU N, VOWEL SIGN E> is
a valid sequence (Examples in Unicode10.0.0/ch12 Table 12.36). It's legacy
equivalent would be <NA, VIRAMA, ZWJ, VOWEL SIGN E>. It might be OK to
disallow this; but, we should be mindful of this side effect.

​To account for the legacy cases, the simplest approach might be to add
some characters to GCB=
LinkingConsonant
​

Note:
​The final date for deciding exactly what to do with #29 will be in April,
so there is some more time to discuss this. But we have to have a pretty
solid proposal going into that April meeting. ​
The only test files that we have gotten from India so far include
Devanagari, Malayalam and Bengali. I suspect that the UTC is likely to be
conservative, and limit the GCB=Virama category to just those scripts that
we have test files for
​, and that look complete.​
​




Mark

On Mon, Dec 11, 2017 at 2:16 AM, Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> On Sun, 10 Dec 2017 21:14:18 -0800
> Manish Goregaokar via Unicode <unicode at unicode.org> wrote:
>
> > > GB9c: (Virama | ZWJ )   × Extend* LinkingConsonant
> >
> > You can also explicitly request ligatureification with a ZWJ, so
> > perhaps this rule should be something like
> >
> > (Virama ZWJ? | ZWJ) x Extend* LinkingConsonant
> >
> > -Manish
> >
> > On Sat, Dec 9, 2017 at 7:16 AM, Mark Davis ☕️ via Unicode <
> > unicode at unicode.org> wrote:
> >
> > > 1. You make a good point about the GB9c. It should probably instead
> > > be something like:
> > >
> > > GB9c: (Virama | ZWJ )   × Extend* LinkingConsonant
>
> This change is unnecessary.  If we start from Draft 1 where there are:
>
> GB9:            ×       (Extend | ZWJ | Virama)
> GB9c:   (Virama | ZWJ )         ×       LinkingConsonant
>
> If the classes used in the rules are to be disjoint, we then have to
> split Extend into something like ViramaExtend and OtherExtend to allow
> normalised (NFC/NFD) text, at which point we may as well continue to
> have rules that work without any normalisation. Informally,
>
> ViramaExtend = Extend and ccc ≠ 0.
>
> OtherExtend = Extend and ccc = 0.
>
> (We might need to put additional characters in ViramaExtend.)
>
> This gives us rules:
>
> GB9': × (OtherExtend | ViramaExtend | ZWJ | Virama)
>
> GB9c':  (Virama | ZWJ ) ViramaExtend* × LinkingConsonant
>
> So, for a sequence <virama, ZWJ, nukta, LinkingConsonant>, GB9' gives us
>
> virama × ZWJ × nukta LinkingConsonant
>
> and GB9c' gives us
>
> virama × ZWJ × nukta × LinkingConsonant
>
> ---
> In Rule GB9c, what examples justify including ZWJ?  Are they just the C1
> half-forms?  My knowledge suggests that
>
> GB9c'': Virama (ZWJ | ViramaExtend)* × LinkingConsonant
>
> might be more appropriate.
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180121/8133922a/attachment.html>


More information about the Unicode mailing list