Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

Richard Wordingham via Unicode unicode at unicode.org
Tue May 29 15:43:52 CDT 2018


On Tue, 29 May 2018 07:27:21 -0700
Ken Whistler via Unicode <unicode at unicode.org> wrote:

> On 5/29/2018 12:49 AM, Richard Wordingham via Unicode wrote:
> > How would one know that they are misapplied?  And what if the
> > author of the text has broken your rules? Are such texts never to
> > be transcribed to pukka Unicode?  
> 
> Applying Tamil -ii (0BC0, Script=Tamil) to the Latin letter a (0061, 
> Script=Latin) doesn't automatically make the Tamil vowel "inherit"
> the Latin script property value, nor should it.

It's the sort of process that gave us U+0310 COMBINING CANDRABINDU.
However, I see adding SE Asian dependent vowels to Latin letter x
(U+0078, Script=Latin) as rather tending to make 'x' Script=Common.
Others have disagreed quite vehemently.  I see the view that the base
character is U+00D7 MULTIPLICATION SIGN (InSC=Consonant_Placeholder) has
prevailed.  Serifed U+00D7 is quite common in manually typewritten
material; I remember it from school.  I'm not sure what script the
sequence <U+00D7, U+0EB5 LAO VOWEL SIGN II> belongs to in OpenType
layout. I ought to find out for the benefit of Tai Tham fonts.

> That said, if someone decides they want that sequence, and their text
> as "broken my rules", so be it. I'm just not going to assume anything 
> particular about that text. Note that in terms of trying to determine 
> whether such a string is (naively) alphabetic, such a sequence
> doesn't interfere with the determination. On the other hand, a
> process concerned about text runs, script assignment, validity for
> domains, or other such issues *will* be sensitive to such a boundary
> -- and should not be overruled by some generic determination that
> combining marks inherit all the properties of their base.

When it comes to script runs for rendering, such a rule feels
oppressive; it is widely unenforced.  For example, I have found that
if my font treats U+0E4A THAI CHARACTER MAI TRI as a Tai Tham
character, it will generally render satisfactorily on a Tai Tham
character.  Presumably I can now use a few examples of the same
Northern Thai syllable on the same page in a published language-teaching
book as evidence for adding its clone to the Tai Tham script.  There
should also be some examples of U+0ECA LAO TONE MAI TI on Lao Tai Tham
syllables, but I haven't found any yet.  See the chart at the end of
"Exemple d’écriture ignorée par Unicode : l’écriture tham du Laos"
http://www.laosoftware.com/download/articleTALN.pdf for an implicit
claim of existence.

> > Even without knowing exactly what is wanted, it looks to me as
> > though it isn't.  If he wants to allow <pulli, ZWNJ> as a
> > substring, which he should, then that fails because there is no
> > overlap between p{extender} and p{gc=Cf} or between p{diacritic}
> > and p{gc=Cf}.  
> 
> Yes, so if you are working with strings for Indic scripts (or for
> that matter, Arabic), you add Join_Control to the mix:
> 
> Alphabetic  ∪ Diacritic ∪ Extender ∪ Join_Control
> 
> gets you a decent approximation of what is (naively) expected to fall 
> within an "alphabetic" string for most scripts.

but won't work for collatable Welsh 'Llan͏gollen'!  (There's a CGJ
between the 'n' and the 'g'.)


One also needs Join_Control for fraktur German and, to my mind,
English 'Ca‍esar'.

> For those following along, Alphabetic is roughly meant to cover the
> ABC, かきくけこ,... plus ideographic elements of most scripts.
> Diacritic picks up most of the applied combining marks, including
> nuktas, viramas, and tone marks. Extender picks up spacing elements
> that indicate length, reduplication, iteration, etc. And joiners are,
> well, joiners.

'Diacritic' mostly includes marks with secondary collation weight;
those with primary weights, such as Indic dependent vowels, are mopped
up in Alphabetic.  (Removing diacritics is very much not the same
as removing combining marks.)

> If one wants finer categorization specifically for Indic scripts,
> then I would suggest turning to the Indic_Syllabic_Category property
> instead of a union of PropList.txt properties and/or some twiddling
> with General_Category values.

You'd still need to add gc=L to catch things like U+0971 DEVANAGARI SIGN
HIGH SPACING DOT (which starts syllables) and U+A8F4 DEVANAGARI SIGN
DOUBLE CANDRABINDU VIRAMA.  And you'd still miss U+0303 COMBINING TILDE
and U+0331 COMBINING MACRON BELOW from Thai script Pattani Malay - I
need to make another attempt to get them appropriate Indic syllabic
category values.

Richard.



More information about the Unicode mailing list