Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

Richard Wordingham via Unicode unicode at unicode.org
Mon May 28 07:57:26 CDT 2018


On Mon, 28 May 2018 00:57:03 +0530
SundaraRaman R via Unicode <unicode at unicode.org> wrote:

> Hi,
> 
> In languages like Ruby or Java
> (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)),
> functions to check if a character is alphabetic do that by looking for
> the 'Alphabetic'  property (defined true if it's in one of the L
> categories, or Nl, or has 'Other_Alphabetic' property). When parsing
> Tamil text, this works out well for independent vowels and consonants
> (which are in Lo), and for most dependent signs (which are in Mc or Mn
> but have the 'Other_Alphabetic' property), but the very common pulli
> (VIRAMA) is neither in Lo nor has 'Other_Alphabetic', and so leads to
> concluding any string containing it to be non-alphabetic.
> 
> This doesn't make sense to me since the Virama  “◌்” as much of an
> alphabetic character as any of the "Dependent Vowel" characters which
> have been given the 'Other_Alphabetic' property. Is there a rationale
> behind this difference, or is it an oversight to be corrected?

There is only one character with a canonical combining class of 9 that
is included as other_alphabetic, namely U+0E3A THAI CHARACTER PHINTHU.
That last had any of the other properties of viramas back in Unicode
1.0; the characters that triggered such behaviours were permanently
removed in Unicode 1.1.

There are some notable absences from the combining marks included.
Significant absences include ZWJ, ZWNJ and CGJ.

However, a non-erroneous *conformant* Unicode process cannot
always determine whether a string, given only that it is a string, is
composed only of alphabetic characters.  The answer would be 'yes' for
<U+00E7 LATIN SMALL LETTER C WITH CEDILLA> but 'no' for the canonically
equivalent <U+0063 LATIN SMALL LETTER C, U+0327 COMBINING CEDILLA>!
(U+0327 is not included as alphabetic either.)

There is at least one combination of Latin letter and combining mark
that occurs in the normal orthography of a natural language and does not
have a precomposed equivalent.

I fear that the correct test for what you want is to split text into
words and check that each word begins with an alphabetic character.
That test can be made by a conformant process.  I think, but have not
checked, that the test an be simplified to:

(a) Check that the first character is alphabetic.

(b) Ignore every character with a WordBreak property of Extend or ZWJ

(c) Check that all other characters are alphabetic.

Richard.



More information about the Unicode mailing list