Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
Richard Wordingham via Unicode
unicode at unicode.org
Tue May 29 02:49:52 CDT 2018
On Mon, 28 May 2018 22:02:15 -0700
Ken Whistler via Unicode <unicode at unicode.org> wrote:
> On 5/28/2018 9:44 PM, Asmus Freytag via Unicode wrote:
> > One of the general principles is that combining marks inherit the
> > property of their base character.
> >
> > Normally, "inherited" should be the only property value for
> > combining marks.
> >
> > There have been some deviations from this over the years, for
> > various reasons, and there are some properties (such as general
> > category) where it is necessary to recognize the character as
> > combining, but the general principle still holds.
> >
> > Therefore, if you are trying to see whether a string is alphabetic,
> > combining marks should be "transparent" to such an algorithm.
>
> Generally, good advice. But there are clear exceptions. For example,
> the enclosing combining marks for symbols are intended (basically) to
> make symbols of a sort. And many combining marks have explicit script
> assigments, so they cannot simply willy-nilly inherit the script of a
> base letter if they are misapplied, for example.
How would one know that they are misapplied? And what if the author of
the text has broken your rules? Are such texts never to be transcribed
to pukka Unicode?
> This is why I recommend simply adding the Diacritic property into the
> mix for testing a string. That is a closer approximation to the kind
> of naive "Is this string alphabetic?" question that SunaraRaman was
> asking about -- it picks up the correct subset of combining marks to
> union with the set of actual isAlphabetic characters, to produce more
> expected results. (Including, of course, the correct classification
> of all the viramas, stackers, and killers, as well as picking up all
> the nuktas.).
>
> Folks, please examine the set of character for Diacritic and for
> Extender in:
>
> http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
>
> to see what I'm talking about. The stuff you are looking for is
> already there.
Even without knowing exactly what is wanted, it looks to me as though
it isn't. If he wants to allow <pulli, ZWNJ> as a substring, which
he should, then that fails because there is no overlap between
p{extender} and p{gc=Cf} or between p{diacritic} and p{gc=Cf}. U+034F
COMBINING GRAPHEME JOINER is also missing, apparently deliberately in
the case of 'diacritic'. If one uses the definition of words in the
word break algorithm, one will end up accepting combinations of letter
plus enclosing circle or keycap. (A fix to the word break algorithm
for that would be unpleasant.)
One hopes that the requirement doesn't include accepting all single
words. Every properly spelt word containing U+0E46 THAI CHARACTER
MAIYAMOK will be rejected, as it will contain a space before the
U+0E46. (I assume there are such words; certainly there are
dictionary entries with no corresponding entries without U+0E46,
such as "ตึ้ก ๆ".) At a lesser level, even English has a very few
words with spaces in them, and there is no solution but to list them.
Richard.
More information about the Unicode
mailing list