Difference between ‘combining characters’ and ‘grapheme extenders’?

Richard Wordingham richard.wordingham at ntlworld.com
Thu Feb 20 14:00:15 CST 2014


On Thu, 20 Feb 2014 11:42:01 +0100
Mathias Bynens <mathias at qiwi.be> wrote:

> What is the difference between ‘combining
> characters’ (http://www.unicode.org/faq/char_combmark.html) and
> ‘grapheme
> extenders’ (http://www.unicode.org/reports/tr44/#Grapheme_Extend) in
> Unicode?
> 
> They seem to do the same thing, as far as I can tell – although the
> set of grapheme extenders is larger than the set of combining
> characters. I’m clearly missing something here. Why the distinction?

Spacing combining marks (category Mc) are in general not grapheme
extenders.  The ones that are included are mostly included so that the
boundaries between 'legacy grapheme clusters'
http://www.unicode.org/reports/tr29/tr29-23.html are invariant under
canonical equivalence.  There are six grapheme extenders that are not
nonspacing (Mn) or enclosing (Me) and are not needed by this rule:
ZWNJ, ZWJ,
U+302E HANGUL SINGLE DOT TONE MARK
U+302F HANGUL DOUBLE DOT TONE MARK
U+FF9E HALFWIDTH KATAKANA VOICED SOUND MARK
U+FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

I can see that it will sometimes be helpful to ZWNJ and ZWJ along with
the previous base character.  The fullwidth soundmarks U+3099 and
U+309A are included for reasons of canonical equivalence, so it makes
sense to include their halfwidth versions.

I don't actually see the logic for including U+302E and U+302F.  If
you're going to encourage forcing someone who's typed the wrong base
character before a sequence of 3 non-spacing marks to retype the lot,
you may as well do the same with Hangul tone marks.

Richard.



More information about the Unicode mailing list