Combining characters

Asmus Freytag asmusf at ix.netcom.com
Sun Dec 14 17:55:21 CST 2025


On 12/14/2025 3:22 PM, Mark E. Shoulson via Unicode wrote:
>
> On 12/14/25 5:44 PM, Asmus Freytag via Unicode wrote:
>
>> On 12/14/2025 10:47 AM, Phil Smith III via Unicode wrote:
>>>
>>> Well, I’m sorta “asking for a friend” – a coworker who is deep in 
>>> the weeds of working with something Unicode-related. I’m blaming him 
>>> for having told me that :)
>>>
>>>
>> This actually deserves a deeper answer, or a more "bird's-eye" one, 
>> if you want. Read to the end.
>>
>> The way you asked the question seems to hint that in your minds you 
>> and your friend conflate the concept of "combining" mark and 
>> "diacritic". That would not be surprising if you are mainly familiar 
>> with European scripts and languages, because in that case, this 
>> equivalence kind of applies.
>>
> Yes.  This is crucial.  You (Phil) are writing like "sheez, so there's 
> e and there's e-with-an-acute, we might as well just treat them like 
> separate letters."  And that maybe makes sense for languages where 
> "combining characters" are maybe two or three diacritics that can live 
> on five or six letters.  Maybe it does make sense to consider those 
> combinations as distinct letters (indeed, some of the languages in 
> question do just that.)  But some combining characters are more 
> rightly perceived as things separate from the letters which are 
> written in the same space (and have historically always been 
> considered so). The most obvious examples would be Hebrew and Arabic 
> vowel-points.  Does it really make sense to consider בְ and בֶ and בְּ 
> and all the other combinatorics as separate distinct things, when they 
> clearly contain separate units, each of which has its own consistent 
> character?  Throw in the Hebrew "accents" (cantillation marks) and 
> you're talking an enormous combinatorial explosion at the *cost* of 
> simplicity and consistency, not improving it.  Ditto Indic vowel-marks 
> and a jillion other abjads and abugidas.
>
Nice examples to back up what I wrote.
>
>  If anything, there's a better case to be made that the precomposed 
> letters were maybe a wrong move.
>
>
That "might" have been the case, had Unicode been created in a vacuum.

Instead, Unicode needed to offer the easiest migration path from the 
installed base of pre-existing character encodings, or risk failing to 
gain ground at all.

All the early systems mainly started out with legacy applications and 
legacy data that needed to be supported as transparently as possible. 
Given the pervasive amount of indexing into strings and length 
calculations that are deeply embedded into these legacy applications, 
trying to support these with a different encoding model (not just with a 
different encoding) would have been a non-starter.

As we've seen since, the final key in that puzzle was IETF creating an 
ASCII compatible, variable length encoding form that violated one of 
Unicode's early design goals (to have a fixed number of code units per 
character). However, allowing direct parsing of data streams for 
ASCII-based syntax characters was more of a compatibility requirement 
than had appeared at first.

The reason, this was not built directly into the earliest Unicode 
versions was that it is something that (transport) protocol designers 
are up against more than people worried about representing text in 
documents.

Looking at Unicode from the perspective "what, if I could design 
something from scratch?" can be intellectually interesting but is of 
little practical value. Any design that would have prevented people from 
different legacy environments from coalescing around would simply have 
died out.

If it amuses you, you could think of some features of Unicode as being 
akin to the "vestiginal" organs that evolution sometimes leaves behind. 
They may not strictly be required the way the organism functions today, 
but without their use in the historical transition, the current form of 
the organism would not exist, because the species would be extinct.

A./

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20251214/3020ff0b/attachment.htm>


More information about the Unicode mailing list