Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

Ken Whistler via Unicode unicode at unicode.org
Tue May 29 00:02:15 CDT 2018



On 5/28/2018 9:44 PM, Asmus Freytag via Unicode wrote:
> One of the general principles is that combining marks inherit the 
> property of their base character.
>
> Normally, "inherited" should be the only property value for combining 
> marks.
>
> There have been some deviations from this over the years, for various 
> reasons, and there are some properties (such as general category) 
> where it is necessary to recognize the character as combining, but the 
> general principle still holds.
>
> Therefore, if you are trying to see whether a string is alphabetic, 
> combining marks should be "transparent" to such an algorithm.

Generally, good advice. But there are clear exceptions. For example, the 
enclosing combining marks for symbols are intended (basically) to make 
symbols of a sort. And many combining marks have explicit script 
assigments, so they cannot simply willy-nilly inherit the script of a 
base letter if they are misapplied, for example.

This is why I recommend simply adding the Diacritic property into the 
mix for testing a string. That is a closer approximation to the kind of 
naive "Is this string alphabetic?" question that SunaraRaman was asking 
about -- it picks up the correct subset of combining marks to union with 
the set of actual isAlphabetic characters, to produce more expected 
results. (Including, of course, the correct classification of all the 
viramas, stackers, and killers, as well as picking up all the nuktas.).

Folks, please examine the set of character for Diacritic and for 
Extender in:

http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

to see what I'm talking about. The stuff you are looking for is already 
there.

--Ken

P.S. And please don't start an argument about the fact that a "virama" 
isn't really a "diacritic". We know that, too. ;-)




More information about the Unicode mailing list