Combining Characters
Martin J. Dürst
duerst at it.aoyama.ac.jp
Sun Dec 14 17:54:33 CST 2025
Hello everybody,
On 2025-12-15 05:25, Don Hosek via Unicode wrote:
> Just one additional note on this: Everything around combining characters,
> normalization and grapheme segmentation is data-driven. Other than when new
> rules for Indic scripts were introduced with Unicode 15.1.0, the only thing
> I’ve needed to update for my Unicode grapheme library has been to import
> the newest Unicode data tables. I’ve not written normalization code (yet),
> but from everything that I’ve seen on that front, it looks like a similar
> thing where again, everything is data-driven.
That's essentially true, based on my experience with Unicode-related
code for the programming language Ruby.
> The only case I can see where things could get weird would be if there
> suddenly became some weird case where, e.g., the Jovians insisted that the
> combining backslash must appear before the letter and not after it (and
> it’s been a few years since I had to really look at the rules and this
> might be possible with the existing combining character classes anyway).
Because of the way we have optimized normalization in Ruby (caching
normalization results for runs of a base character followed by
modifiers), that wasn't exactly true when we upgraded to Unicode 16.0.0.
See the "Normalization Behavior" entry at
https://www.unicode.org/versions/Unicode16.0.0/#Migration.
New scripts introduced in 16.0.0 (Kirat Rai, Tulu-Tigalari, and Gurung
Khema) contained combining marks that had combining class 0 and were
also base characters combining with other combining marks (or even with
themselves). That was something we hadn't taken account of in our
implementation previously (because it was not needed).
You can see an example at
https://github.com/ruby/ruby/blob/master/test/test_unicode_normalize.rb#L219:
assert_equal "\u{16121 16121 16121 16121 16121 1611E}",
"\u{1611E 16121 16121 16121 16121 16121}".unicode_normalize
U+1611E is GURUNG KHEMA VOWEL SIGN AA, a single bar on top of a
character. It combines with itsel to form
U+16121, GURUNG KHEMA VOWEL SIGN U, which is a double bar above.
Although not required for actually writing Gurung Khema (or so I
assume), the correct form to represent a number of bars above (11 in the
test code above) is to first group them into pairs with U+16121, and
only in the case of an odd number add a single U+1611E to the end.
Regards, Martin.
More information about the Unicode
mailing list