Combining characters

Asmus Freytag asmusf at ix.netcom.com
Sun Dec 14 16:44:49 CST 2025


On 12/14/2025 10:47 AM, Phil Smith III via Unicode wrote:
>
> Well, I’m sorta “asking for a friend” – a coworker who is deep in the 
> weeds of working with something Unicode-related. I’m blaming him for 
> having told me that :)
>
>
This actually deserves a deeper answer, or a more "bird's-eye" one, if 
you want. Read to the end.

The way you asked the question seems to hint that in your minds you and 
your friend conflate the concept of "combining" mark and "diacritic". 
That would not be surprising if you are mainly familiar with European 
scripts and languages, because in that case, this equivalence kind of 
applies.

And you may also be thinking mainly of languages and their 
orthographies, and not of notations, phonetic or otherwise, that give 
rise to unusual combinations. Most European languages do have a 
reasonably small, fixed set of letters with diacritics in their 
orthographies, even though there are many languages where, if you ask 
the native users to list all the combinations, they will fall short. 
Example is the use of an accent with the letter 'e' in some of the 
Scandinavian languages to distinguish two identically spelled small 
words that have very different functions in the syntax. You will see 
that accent used in books and formal writing, but I doubt people bother 
when writing a text message.

The focus on code space is a red herring to a degree. The real 
difficulty would be in cataloging all of the rare combinations, and get 
all fonts to be aware of them. It is much easier to encode the diacritic 
as a combining character and have general rules for layout. With modern 
fonts, you can, in principle, get acceptable display even for unexpected 
combinations without the effort of first cataloging, then publishing and 
then having all font vendors explicitly adding an implementation for 
that combination before it can be used.

Other languages and scripts have combinatorics as part of their DNA, so 
to speak. Their structural unit is not the letter (with or without 
decorations) but the syllable, which is naturally combined from 
components that graphically attach to each other or even fuse into a 
combined shape. Because that process is not random, it's easier to 
encode these structural elements (some of which are combining 
characters) than to try to enumerate the possible combinations. It 
doesn't hurt that the components nicely map onto discrete keys on the 
respective keyboards.

Notations, such as scientific notation, also often assigns a discrete 
identity to the combining mark. A dot above can be the first derivative 
with respect to time, which can be applied to any letter designating a 
variable, which can be, at the minimum any letter from the Latin or 
Greek alphabets, but why stop there. There's nothing in the notation 
itself that would enjoin a scientist from combining that dot with any 
character they find suitable. The only sensible solution is encoding a 
combining mark, even though some letters exist that have a dot above as 
part of an orthography and are also encoded in precomposed form.

In contrast, Chinese ideographs, while visually composed of identifiable 
elements, are treated by their users as units and well before Unicode 
came along there was an established approach how to manage things like 
keyboard entry while encoding these as precomposed entities and not as 
their building blocks.

A big part of the encoding decision is always to do what makes sense for 
the writing system or notation (and the script it is based on).

For a universal encoding, such as Unicode, there simply isn't a 
"one-size-fits-all" solution that would work. But if you look at this 
universal encoding only from a very narrow perspective of the 
orthographies that you are most familiar with, then, understandably, you 
might feel that anything that isn't directly required (from your point 
of view) is an unnecessary complication.

However, once you adopt a more universal perspective, it's much easier 
to not rat-hole on some seeming inconsistencies, because you can always 
discover how certain decisions relate to the specific requirements for 
one or more writing systems. Importantly, this often includes 
requirements based on de-facto implementations for these systems before 
the advent of Unicode. Being universal, Unicode needed to be designed to 
allow easy conversion from all existing data sets. And for European 
scripts, the business community and the librarians had competing 
systems, one with limited sets of precomposed characters and one with 
combining marks for diacritics. The ultimate source of the duality stems 
from there, but the two communities had different goals. One wanted to 
efficiently handle the common case (primarily mapping all the modern 
national typewriters into character encoding) while the other was 
interested in a full representation of anything that could be present in 
printed book titles (for cataloging), including unusual or historic 
combinations.

In conclusion, the question isn't a bad one, but the real answer is that 
complexity is very much part of human writing, and when you design (and 
extend) a universal character encoding, you will need to be able to 
represent that full degree of complexity. Therefore, what seem like 
obvious simplifications really aren't feasible, unless you give up on 
attempting to be universal.

A./
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20251214/12aef077/attachment.htm>


More information about the Unicode mailing list