Combining Characters

Martin J. Dürst duerst at it.aoyama.ac.jp
Sun Dec 14 17:54:33 CST 2025


Hello everybody,

On 2025-12-15 05:25, Don Hosek via Unicode wrote:

> Just one additional note on this: Everything around combining characters,
> normalization and grapheme segmentation is data-driven. Other than when new
> rules for Indic scripts were introduced with Unicode 15.1.0, the only thing
> I’ve needed to update for my Unicode grapheme library has been to import
> the newest Unicode data tables. I’ve not written normalization code (yet),
> but from everything that I’ve seen on that front, it looks like a similar
> thing where again, everything is data-driven.

That's essentially true, based on my experience with Unicode-related 
code for the programming language Ruby.


> The only case I can see where things could get weird would be if there
> suddenly became some weird case where, e.g., the Jovians insisted that the
> combining backslash must appear before the letter and not after it (and
> it’s been a few years since I had to really look at the rules and this
> might be possible with the existing combining character classes anyway).

Because of the way we have optimized normalization in Ruby (caching 
normalization results for runs of a base character followed by 
modifiers), that wasn't exactly true when we upgraded to Unicode 16.0.0.
See the "Normalization Behavior" entry at 
https://www.unicode.org/versions/Unicode16.0.0/#Migration.

New scripts introduced in 16.0.0 (Kirat Rai, Tulu-Tigalari, and Gurung 
Khema) contained combining marks that had combining class 0 and were 
also base characters combining with other combining marks (or even with 
themselves). That was something we hadn't taken account of in our 
implementation previously (because it was not needed).

You can see an example at 
https://github.com/ruby/ruby/blob/master/test/test_unicode_normalize.rb#L219:
     assert_equal "\u{16121 16121 16121 16121 16121 1611E}",
              "\u{1611E 16121 16121 16121 16121 16121}".unicode_normalize
U+1611E is GURUNG KHEMA VOWEL SIGN AA, a single bar on top of a 
character. It combines with itsel to form
U+16121, GURUNG KHEMA VOWEL SIGN U, which is a double bar above.

Although not required for actually writing Gurung Khema (or so I 
assume), the correct form to represent a number of bars above (11 in the 
test code above) is to first group them into pairs with U+16121, and 
only in the case of an odd number add a single U+1611E to the end.

Regards,   Martin.


More information about the Unicode mailing list