Amiguity(?) in Sinhala named sequences
mjansche at google.com
Mon Oct 17 09:58:13 CDT 2016
Thanks for the pointer to the 2011 version of SLS 1134. After reading that
and discussing further with Cibu, here's a tentative proposal:
* The most logical[*] interpretation of the sequence 0DBB 0DCA 200D 0DBA is
as Repaya+Ya. A standard (Unicode and/or SLS) should call this out
explicitly. ([*]Logical: In other scripts, including Devanagari, Myanmar,
etc. similar types of modifiers that logically precede a letter are
represented in this way, sometimes without ZWJ or with a different
character in lieu of ZWJ. Also this interpretation plays well alongside a
hypothetical alternative encoding of Yansaya using a single codepoint.)
* A standard (Unicode and/or SLS) should specify how Ra+Yansaya should be
encoded. SLS 1134 points out that Ra+Yansaya is an incorrect spelling, yet
in order to make this point it has to show the glyph sequence for
Ra+Yansaya. So there is clearly some need to be able to render this, even
if it's only at this meta-linguistic level. Plus SLS 1134 is very explicit
that e.g. keyboarding should allow for letter combinations to be entered
even if they are not practically useful. One possible way of encoding
Ra+Yansaya is 0DBB 200C 0DCA 200D 0DBA, i.e. Ra ZWNJ Yansaya. This renders
as intended in HarfBuzz with NotoSansSinhala, but not with
LBhashitaComplex. If we had a clear directive regarding how Ra+Yansaya
should be represented, we could work on getting fonts updated.
* Everything about 0DBB 0DCA 200D 0DBA also applies to 0DBB 0DCA 200D 0DBB.
This is much less relevant in practice, but the same arguments about
ambiguity apply and should be resolved in the same way.
On Mon, Oct 17, 2016 at 12:15 AM, Harshula <harshula at hj.id.au> wrote:
> Hi Martin,
> On 15/10/16 04:07, Martin Jansche wrote:
> > For Sinhala, the following named sequences are defined (for good
> > SINHALA CONSONANT SIGN YANSAYA;0DCA 200D 0DBA
> > SINHALA CONSONANT SIGN RAKAARAANSAYA;0DCA 200D 0DBB
> > SINHALA CONSONANT SIGN REPAYA;0DBB 0DCA 200D
> > I'll abbreviate these as Yansaya, Rakaransaya, and Repaya, and I'll
> > write Ya for 0DBA and Ra for 0DBB.
> > Note that these give rise to two potentially ambiguous codepoint
> > strings, namely
> > 0DBB 0DCA 200D 0DBA
> > 0DBB 0DCA 200D 0DBB
> > I'll concentrate on the first, as all arguments apply to the second one
> > analogously.
> > At a first glance, the sequence 0DBB 0DCA 200D 0DBA has two possible
> > 0DBB + 0DCA 200D 0DBA, i.e. Ra + Yansaya
> > 0DBB 0DCA 200D + 0DBA, i.e. Repaya + Ya
> > First question: Does the standard give any guidance as to which one is
> > the intended parse? The section on Sinhala in the Unicode Standard is
> > silent about this. Is there a general principle I'm missing?
> > Sri Lanka Standard SLS 1134 (2004 draft) states that Ra+Yansaya is not
> > used and is considered incorrect, suggesting that the second parse
> > (Repaya+Ya) should be the default interpretation of this sequence.
> > However, SLS 1134 does not address the potential ambiguity of this
> > sequence explicitly and the description there could be read as
> > informative, not normative.
> 1) re: 0DBB 0DCA 200D 0DBA
> SLS 1134 was updated in 2011 (The latest public version I could find is
> v3.41. This extract is the same in v3.6.):
> 4D957C56.5050204 at cse.mrt.ac.lk/1/
> "1. The yansaya is not used following the letter ර. e.g.: the spelling
> කාර්ය is incorrect."
> If the above is insufficient, it's best to discuss the issue with Harsha
> (CC'd) and Ruvan (CC'd).
> 2) re: 0DBB 0DCA 200D 0DBB
> Harsha & Ruvan can clarify this too.
> > Second question: Given that one parse of this sequence should be the
> > default, how does one represent the non-default parse?
> > In most cases one can guess what the intended meaning is, but I suspect
> > this is somewhat of a gray area. In practice, trying to render these
> > problematic sequences and their neighbors in HarfBuzz with a variety of
> > fonts results in a variety of outcomes (including occasionally
> > unexpected glyph choices). If the meaning of these sequences is not well
> > defined, that would partly explain the variation across fonts.
> > Am I missing something fundamental? If not, it seems this issue should
> > be called out explicit in some part of the standard.
> > Regards,
> > -- martin
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode