Do `Grapheme_Extend` characters only apply to `Grapheme_Base`?

Whistler, Ken ken.whistler at sap.com
Thu Apr 24 14:38:54 CDT 2014


> On 23 Apr 2014, at 22:16, Mathias Bynens <mathias at qiwi.be> wrote:
> 
> > Let’s say I’m writing a program that strips combining characters and
> grapheme extenders from an input string.
> >
> > For combining marks, I’m looking for any non-combining marks (e.g. `a`)
> followed by one or more combining marks (e.g. `̃`), and then I remove
> everything but the non-combining mark (e.g. leaving only `a`). Is this a
> correct approach?
> >
> > What should the approach be for grapheme extenders? Should the
> program only look for `Grapheme_Base` characters followed by
> `Grapheme_Extend` characters (which includes the code points in
> `Other_Grapheme_Extend`)?
> 
> The email subject should have been “Do `Grapheme_Extend` characters only
> apply to `Grapheme_Base`?” — sorry for the confusion.
> 
> Does anyone know the answer?

Yes. Grapheme_Extend characters per se do not "apply" to anything.
They are a mixture of different General_Category types -- mostly combining
marks, but not all. The concept of applying to a base only refers to
combining marks proper.

The proper use of the Grapheme_Extend property is in the context of the
text segmentation algorithms defined in UAX #29, and in particular:

http://www.unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table

See that document for the proper use. They are relevant to the determination
of grapheme cluster boundaries.

And by the way, it is a very bad idea to be writing a program to just unilaterally
strip away grapheme extenders from input strings. In particular, many dependent
vowels in Indic scripts are defined as grapheme extenders. If you strip them
away, the input string will just end up as random trash. That is very, very
different from something which is trying to strip diacritics and accent marks
off of Latin letters.

--Ken




More information about the Unicode mailing list