U+0F81 Canonical Combining Class?

Tue Jul 29 15:02:33 CDT 2025

On the surface, this does seem confusing: it seems like it could imply that there might be an existing problem with normalization — a sequence that should be equivalent to its NFD form but that would have marks in a different order in the NFD form. 

However, wrt that apparent contradiction, it's important to keep in mind that canonical combining classes are used in conjunction with Unicode normalization, and that all defined normalization forms begin with decomposition followed by canonical ordering of marks.

So, for instance, consider a character sequence < 0F81, 0F84 >. The canonical combining class of 0F81, implying that nothing would re-order around that character in canonical ordering. And compare that with the equivalent decomposed sequence (using the decomposition mapping for 0F81), < 0F71, 0F80, 0F84 >. The canonical combining classes of those characters, in sequence, are < 129, 130, 9 >, and so you might expect those would canonically reorder in the order 9 < 129 < 130, hence a sequence < 0F84, 0F71, 0F80 >. Yet the sequence < 0F84, 0F71, 0F80 > clearly is not equivalent to the original sequence < 0F81, 0F84 >.

The fallacy in that reasoning is the step of considering canonical ordering of the non-decomposed sequence < 0F81, 0F84 >. Canonical reordering is only ever intended to be done on fully decomposed sequences.

That explains why there isn't any contradiction regarding normalization.

But now to get to your question: isn't it a discrepancy to have a mark with ccc=0 decompose to a sequence of marks with ccc > 0? 

The only potential discrepancy that would matter would be if there were a problem with normalization. That's because canonical combining classes only have relevance in relation to normalization. And I've explained above why there isn't any such issue.

So, with that in mind... Every combining mark must be assigned some canonical combining class. In this case, we're considering a mark that's a precomposed form for a sequence of marks with different combining classes, 129 and 130. If 0F81 were assigned ccc = 129, that would seem strange (and you or someone else would eventually ask for an explanation). Likewise, if 0F81 were assigned ccc = 130. The likely reason why 0F81 as assigned to class 0 is that in needed to be assigned to _some_ class and class 0 was the least strange choice. 

Note that 0F81 could have been assigned to _any_ canonical combining class and it would not have had any effect on normalization: The canonical combining class of a combining mark with a canonical decomposition mapping is never used! Only the ccc for characters in the fully decomposed sequence matters. Even so, I think it's fair to say that ccc = 0 is the least strange assignment for 0F81.

Likewise for 0F73 and 0F75.

Peter

-----Original Message-----
From: Unicode <unicode-bounces at corp.unicode.org> On Behalf Of Diego Frias via Unicode
Sent: July 29, 2025 10:48 AM
To: unicode at corp.unicode.org
Subject: U+0F81 Canonical Combining Class?

The Tibetan Unicode block contains a number of characters (U+0F73, U+0F75, U+0F81) that have a canonical combining class value of zero, and have non-empty decomposition mappings. This is not out of the ordinary, but upon inspecting the code points that they map to, I found that the canonical combining class of each decomposition code point is greater than zero.

In the case of U+0F81, the decomposition mapping is: U+0F71 U+0F80. Both U+0F71 and U+0F80 have canonical combining class values greater than zero, so U+0F81 decomposes solely into combining marks, yet has a canonical combining class value of zero.

What is the reasoning behind this discrepancy? It is my understanding that U+0F81 (TIBETAN VOWEL SIGN REVERSED II, ཱྀ) is supposed to be a combining mark. Also, the Tibetan block is the only block that contains code points with this behavior. It is likely that I'm misunderstanding the semantics of the canonical combining class system.

Diego Frias