Combining characters

Sun Dec 14 23:42:13 CST 2025

On 12/14/2025 9:03 PM, Doug Ewell wrote:
> Asmus Freytag wrote:
>
>> It would be correct for a sequence of a base character with _single_
>> combining mark, but as soon as you have two or more combining marks,
>> their order is defined by NFC.
> I had mistakenly assumed that Phil’s use case considered only sequences with a single combining mark, and consciously chose to limit my response to that scenario.
>
I know that you were aware of the general case. What I was trying to 
communicate (and expounded upon in the other reply) is the degree to 
which human writing in the general case is highly complex, usually even 
more complex than most native speakers (other than typesetters) are ever 
aware of, even for their own language.

And it is acknowledging this complexity — and how it is necessarily 
reflected in anything that aims to be a universal system of character 
encoding — that drives the understanding that such a system must be full 
of complexities of its own that cannot even be reconciled down to a 
minimally simplistic system.

For those of us, unlike the questioner, who have been around this effort 
for any length of time, this complexity can seem to be a given. But many 
people who have not worked in this space are genuinely surprised and 
challenged by it. And that includes people who have impressive 
credentials in technical work. Without realizing it, they apply their 
own native understanding of writing systems as if that was exhaustive or 
even typical. When they try to come up with solutions, such as 
protocols, that need to be robust in the face of the full variety of 
global text (even only the living subset) they may reach conclusions 
that fatefully fall well short of what is needed, or they try to 
"simplify" away complexities that to them feel ill motivated.

Commonly, I also observe that solutions are proposed that "micro-manage" 
some well-understood or familiar subset of characters, but leave a 
protocol without meaningful solutions or safeguards to the vast majority 
which contains all the other scripts and writing systems.

There's no quick fix, but it is my firm conviction that we always need 
to start from a point of correctly scoping the issues as those belonging 
to a "universal" system of character encoding, as opposed to one that is 
optimized for some subset.

A./