Amiguity(?) in Sinhala named sequences

Harshula harshula at hj.id.au
Sun Oct 16 18:15:57 CDT 2016


Hi Martin,

On 15/10/16 04:07, Martin Jansche wrote:
> For Sinhala, the following named sequences are defined (for good reasons):
> 
> SINHALA CONSONANT SIGN YANSAYA;0DCA 200D 0DBA
> SINHALA CONSONANT SIGN RAKAARAANSAYA;0DCA 200D 0DBB
> SINHALA CONSONANT SIGN REPAYA;0DBB 0DCA 200D
> 
> I'll abbreviate these as Yansaya, Rakaransaya, and Repaya, and I'll
> write Ya for 0DBA and Ra for 0DBB.
> 
> Note that these give rise to two potentially ambiguous codepoint
> strings, namely
> 
>   0DBB 0DCA 200D 0DBA
>   0DBB 0DCA 200D 0DBB
> 
> I'll concentrate on the first, as all arguments apply to the second one
> analogously.
> 
> At a first glance, the sequence 0DBB 0DCA 200D 0DBA has two possible parses:
> 
>   0DBB + 0DCA 200D 0DBA, i.e. Ra + Yansaya
>   0DBB 0DCA 200D + 0DBA, i.e. Repaya + Ya
> 
> First question: Does the standard give any guidance as to which one is
> the intended parse? The section on Sinhala in the Unicode Standard is
> silent about this. Is there a general principle I'm missing?
> 
> Sri Lanka Standard SLS 1134 (2004 draft) states that Ra+Yansaya is not
> used and is considered incorrect, suggesting that the second parse
> (Repaya+Ya) should be the default interpretation of this sequence.
> However, SLS 1134 does not address the potential ambiguity of this
> sequence explicitly and the description there could be read as
> informative, not normative.

1) re: 0DBB 0DCA 200D 0DBA

SLS 1134 was updated in 2011 (The latest public version I could find is
v3.41. This extract is the same in v3.6.):
https://sourceforge.net/p/sinhala/mailman/attachment/4D957C56.5050204@cse.mrt.ac.lk/1/

"1.   The yansaya is not used following the letter ර. e.g.: the spelling
කාර‍්‍ය is incorrect."

If the above is insufficient, it's best to discuss the issue with Harsha
(CC'd) and Ruvan (CC'd).

2) re: 0DBB 0DCA 200D 0DBB

Harsha & Ruvan can clarify this too.

cya,
#


> Second question: Given that one parse of this sequence should be the
> default, how does one represent the non-default parse?
> 
> In most cases one can guess what the intended meaning is, but I suspect
> this is somewhat of a gray area. In practice, trying to render these
> problematic sequences and their neighbors in HarfBuzz with a variety of
> fonts results in a variety of outcomes (including occasionally
> unexpected glyph choices). If the meaning of these sequences is not well
> defined, that would partly explain the variation across fonts.
> 
> Am I missing something fundamental? If not, it seems this issue should
> be called out explicit in some part of the standard.
> 
> Regards,
> -- martin


More information about the Unicode mailing list