Amiguity(?) in Sinhala named sequences

സിബു ‌ cibucj at gmail.com
Sun Oct 16 22:12:54 CDT 2016


Hi Martin,

Isn't this question analogous to asking whether the layout engine should
use C1-conjoining form or C2-conjoining form for a <C1, Virama, C2>
sequence in any indic? that is, whether the <C1, Virama> should form a
glyph while C2 keeping its independent form or vice versa. (Potentially
there can be more forms - that is, full ligature and explicit Virama form).
If the question you asked is equivalent, then the answer is traditionally
is left to the font to decide.

BTW, even for a given C1 and C2 for a given script, a font can potentially
choose a different answer based on its its purpose/character, like a font
for Malayalam traditional script Vs a font for reformed script.

regards,
Cibu

On Mon, Oct 17, 2016 at 12:15 AM, Harshula <harshula at hj.id.au> wrote:

> Hi Martin,
>
> On 15/10/16 04:07, Martin Jansche wrote:
> > For Sinhala, the following named sequences are defined (for good
> reasons):
> >
> > SINHALA CONSONANT SIGN YANSAYA;0DCA 200D 0DBA
> > SINHALA CONSONANT SIGN RAKAARAANSAYA;0DCA 200D 0DBB
> > SINHALA CONSONANT SIGN REPAYA;0DBB 0DCA 200D
> >
> > I'll abbreviate these as Yansaya, Rakaransaya, and Repaya, and I'll
> > write Ya for 0DBA and Ra for 0DBB.
> >
> > Note that these give rise to two potentially ambiguous codepoint
> > strings, namely
> >
> >   0DBB 0DCA 200D 0DBA
> >   0DBB 0DCA 200D 0DBB
> >
> > I'll concentrate on the first, as all arguments apply to the second one
> > analogously.
> >
> > At a first glance, the sequence 0DBB 0DCA 200D 0DBA has two possible
> parses:
> >
> >   0DBB + 0DCA 200D 0DBA, i.e. Ra + Yansaya
> >   0DBB 0DCA 200D + 0DBA, i.e. Repaya + Ya
> >
> > First question: Does the standard give any guidance as to which one is
> > the intended parse? The section on Sinhala in the Unicode Standard is
> > silent about this. Is there a general principle I'm missing?
> >
> > Sri Lanka Standard SLS 1134 (2004 draft) states that Ra+Yansaya is not
> > used and is considered incorrect, suggesting that the second parse
> > (Repaya+Ya) should be the default interpretation of this sequence.
> > However, SLS 1134 does not address the potential ambiguity of this
> > sequence explicitly and the description there could be read as
> > informative, not normative.
>
> 1) re: 0DBB 0DCA 200D 0DBA
>
> SLS 1134 was updated in 2011 (The latest public version I could find is
> v3.41. This extract is the same in v3.6.):
> https://sourceforge.net/p/sinhala/mailman/attachment/
> 4D957C56.5050204 at cse.mrt.ac.lk/1/
>
> "1.   The yansaya is not used following the letter ර. e.g.: the spelling
> කාර‍්‍ය is incorrect."
>
> If the above is insufficient, it's best to discuss the issue with Harsha
> (CC'd) and Ruvan (CC'd).
>
> 2) re: 0DBB 0DCA 200D 0DBB
>
> Harsha & Ruvan can clarify this too.
>
> cya,
> #
>
>
> > Second question: Given that one parse of this sequence should be the
> > default, how does one represent the non-default parse?
> >
> > In most cases one can guess what the intended meaning is, but I suspect
> > this is somewhat of a gray area. In practice, trying to render these
> > problematic sequences and their neighbors in HarfBuzz with a variety of
> > fonts results in a variety of outcomes (including occasionally
> > unexpected glyph choices). If the meaning of these sequences is not well
> > defined, that would partly explain the variation across fonts.
> >
> > Am I missing something fundamental? If not, it seems this issue should
> > be called out explicit in some part of the standard.
> >
> > Regards,
> > -- martin
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161017/1a04193f/attachment.html>


More information about the Unicode mailing list