Difference between 'combining characters' and 'grapheme extenders'?

Thu Feb 20 05:10:09 CST 2014

Many grapheme extenders are not "combining characters". Combining
characters are classified this way for legacy reasons (the very weak
"general category" property) and this property is normatively stabilized.
As well most combining characters have a non-zero combining class and they
are stabilized for the purpose of normalization.

Grapheme extenders include characters that are also NOT combining
characters but controls (e.g. joiners). Some graphemclusters are also more
complex in some scripts (there are extenders encoded BEFORE the base
character; and they cannot be classified as combining characters because
combining characters are always encoded AFTER a base character)

For legacy reasons (and roundtrip compatibility with older standards) not
all scripts are encoded using the UCS character model using combining
characters. (E.g. the Thai script; not following the "logical" encoding
order; but following the model used in TIS-620 and other standards based on
it; including for Windows, and *nix/*nux).

2014-02-20 11:42 GMT+01:00 Mathias Bynens <mathias at qiwi.be>:

> What is the difference between 'combining characters' (
> http://www.unicode.org/faq/char_combmark.html) and 'grapheme extenders' (
> http://www.unicode.org/reports/tr44/#Grapheme_Extend) in Unicode?
>
> They seem to do the same thing, as far as I can tell - although the set of
> grapheme extenders is larger than the set of combining characters. I'm
> clearly missing something here. Why the distinction?
>
> I've also posted this question on Stack Overflow:
> http://stackoverflow.com/q/21722729/96656
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140220/4d050ade/attachment.html>