Do `Grapheme_Extend` characters only apply to `Grapheme_Base`?

Richard Wordingham richard.wordingham at ntlworld.com
Thu Apr 24 16:22:35 CDT 2014


On Thu, 24 Apr 2014 19:38:54 +0000
"Whistler, Ken" <ken.whistler at sap.com> wrote:

> Yes. Grapheme_Extend characters per se do not "apply" to anything.
> They are a mixture of different General_Category types -- mostly
> combining marks, but not all. The concept of applying to a base only
> refers to combining marks proper.

> The proper use of the Grapheme_Extend property is in the context of
> the text segmentation algorithms defined in UAX #29, <snip>

A watertight definition of a grapheme cluster is probably impossible.
The precise definition of the legacy grapheme cluster is crafted so
that the process of splitting a string of characters into legacy
grapheme clusters is invariant under canonical equivalence.  The various
Indic AA vowels that are other_grapheme_extend are there because they
are also the second parts of canonical decompositions of multipart
Indic vowels, most typically OO.  However, diametrically opposite
approaches were taken in the 'Myanmar' and Khmer scripts.  In the
Myanmar script, the two-part vowel symbol must be encoded as two
separate characters, as in the various Tai scripts.  In the Khmer
script, the two parts are encoded as a single vowel.  Most of the
scripts of India allow both approaches; Devanagari is the most notable
exception, and the multipart vowels there are primarily used for an
archaic style.

Thus U+09BE BENGALI VOWEL SIGN AA is intended to 'apply to'
U+09C7 BENGALI VOWEL SIGN E, and it is only in the interests of
simplicity and consistency that <U+0995 BENGALI LETTER KA, U+09BE
BENGALI VOWEL SIGN AA> is a grapheme cluster but <U+0995, U+09C0 BENGALI
VOWEL SIGN AA> is not.

Richard Ishida points out in one of his web pages that the practical
definition of a grapheme cluster may actually depend on the font.

> 
> http://www.unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table
> 
> See that document for the proper use. They are relevant to the
> determination of grapheme cluster boundaries.
> 
> And by the way, it is a very bad idea to be writing a program to just
> unilaterally strip away grapheme extenders from input strings.

Thank you, Ken and Doug, for making that point. 

Richard.



More information about the Unicode mailing list