Combining characters
Asmus Freytag
asmusf at ix.netcom.com
Sun Dec 14 23:42:13 CST 2025
On 12/14/2025 9:03 PM, Doug Ewell wrote:
> Asmus Freytag wrote:
>
>> It would be correct for a sequence of a base character with _single_
>> combining mark, but as soon as you have two or more combining marks,
>> their order is defined by NFC.
> I had mistakenly assumed that Phil’s use case considered only sequences with a single combining mark, and consciously chose to limit my response to that scenario.
>
I know that you were aware of the general case. What I was trying to
communicate (and expounded upon in the other reply) is the degree to
which human writing in the general case is highly complex, usually even
more complex than most native speakers (other than typesetters) are ever
aware of, even for their own language.
And it is acknowledging this complexity — and how it is necessarily
reflected in anything that aims to be a universal system of character
encoding — that drives the understanding that such a system must be full
of complexities of its own that cannot even be reconciled down to a
minimally simplistic system.
For those of us, unlike the questioner, who have been around this effort
for any length of time, this complexity can seem to be a given. But many
people who have not worked in this space are genuinely surprised and
challenged by it. And that includes people who have impressive
credentials in technical work. Without realizing it, they apply their
own native understanding of writing systems as if that was exhaustive or
even typical. When they try to come up with solutions, such as
protocols, that need to be robust in the face of the full variety of
global text (even only the living subset) they may reach conclusions
that fatefully fall well short of what is needed, or they try to
"simplify" away complexities that to them feel ill motivated.
Commonly, I also observe that solutions are proposed that "micro-manage"
some well-understood or familiar subset of characters, but leave a
protocol without meaningful solutions or safeguards to the vast majority
which contains all the other scripts and writing systems.
There's no quick fix, but it is my firm conviction that we always need
to start from a point of correctly scoping the issues as those belonging
to a "universal" system of character encoding, as opposed to one that is
optimized for some subset.
A./
More information about the Unicode
mailing list