Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
Ken Whistler via Unicode
unicode at unicode.org
Tue May 29 00:02:15 CDT 2018
On 5/28/2018 9:44 PM, Asmus Freytag via Unicode wrote:
> One of the general principles is that combining marks inherit the
> property of their base character.
>
> Normally, "inherited" should be the only property value for combining
> marks.
>
> There have been some deviations from this over the years, for various
> reasons, and there are some properties (such as general category)
> where it is necessary to recognize the character as combining, but the
> general principle still holds.
>
> Therefore, if you are trying to see whether a string is alphabetic,
> combining marks should be "transparent" to such an algorithm.
Generally, good advice. But there are clear exceptions. For example, the
enclosing combining marks for symbols are intended (basically) to make
symbols of a sort. And many combining marks have explicit script
assigments, so they cannot simply willy-nilly inherit the script of a
base letter if they are misapplied, for example.
This is why I recommend simply adding the Diacritic property into the
mix for testing a string. That is a closer approximation to the kind of
naive "Is this string alphabetic?" question that SunaraRaman was asking
about -- it picks up the correct subset of combining marks to union with
the set of actual isAlphabetic characters, to produce more expected
results. (Including, of course, the correct classification of all the
viramas, stackers, and killers, as well as picking up all the nuktas.).
Folks, please examine the set of character for Diacritic and for
Extender in:
http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
to see what I'm talking about. The stuff you are looking for is already
there.
--Ken
P.S. And please don't start an argument about the fact that a "virama"
isn't really a "diacritic". We know that, too. ;-)
More information about the Unicode
mailing list