Unicode characters unification

Asmus Freytag via Unicode unicode at unicode.org
Mon May 28 23:40:49 CDT 2018

In the discussion leading up to this it has been implied that Unicode 
encodes / should encode concepts or pure shape. And there's been some 
confusion as to were concerns about sorting or legacy encodings fit in. 
Time to step back a bit:

Primarily the Unicode Standard encodes by character identity - something 
that is different from either the pure shape or the "concept denoted by 
the character".

For example, for most alphabetic characters, you could say that they 
stand for a more-or-less well-defined phonetic value. But Unicode does 
not encode such values directly, instead it encodes letters - which in 
turn get re-purposed for different sound values in each writing system.

Likewise, the various uses of period or comma are not separately encoded 
- potentially these marks are given mappings to specific functions for 
each writing system or notation using them.

Clearly these are not encoded to represent a single mapping to an 
external concept, and, as we will see, they are not necessarily encoded 
directly by shape.

Instead, the Unicode Standard encodes character identity; but there are 
a number of principled and some ad-hoc deviations from a purist 
implementation of that approach.

The first one is that of forcing a disunification by script. What 
constitutes a script can be argued over, especially as they all seem to 
have evolved from (or been created based on) predecessor scripts, so 
there are always pairs of scripts that have a lot in common. While an 
"Alpha" and an "A" do have much in common, it is best to recognize that 
their membership in different scripts leads to important differences so 
that it's not a stretch to say that they no longer share the same identity.

The next principled deviation is that of requiring case pairs to be 
unique. Bicameral scripts, (and some of the characters in them), 
acquired their lowercase at different times, so that the relation 
between the upper cases and the lower cases are different across 
scripts, and gives rise to some exceptional cases inside certain scripts.

This is one of the reasons to disunify certain bicameral scripts. But 
even inside scripts, there are case pairs that may share lowercase forms 
or may share uppercase forms, but said forms are disunified to make the 
pairs separate. The two first principles match users expectations in 
that case changes (largely) work as expected in plain text and that 
sorting also (largely) matches user expectation by default.

The third principle is to disunify characters based on line-breaking or 
line-layout properties. Implicit in that is the idea that plain text, 
and not markup, is the place to influence basic algorithms such as 
line-breaking and bidi layout (hence two sets of Arabic-Indic digits). 
Once can argue with that decision, but the fact is, there are too many 
places where text exist without the ability to apply markup to go 
entirely without that support.

The fourth principle is that of differential variability of appearance. 
For letters proper, their identity can be associated with a wide range 
of appearances from sparse to fanciful glyphs. If an entire piece of 
text (or even a given word) is set using a particular font style, 
context will enable the reader to identify the underlying letter, even 
if the shape is almost unrelated to the "archetypical shape" documented 
in the Standard.

When letters or marks get re-used in notational systems, though, the 
permissible range of variability changes dramatically - variations that 
do not change the meaning of a word in styled text, suddenly change the 
meaning of text in a certain notational system. Hence the disunification 
of certain letters or marks (but not all of them) in support of 
mathematical notation.

The fifth principle appears to be to disunify only as far as and only 
when necessary. The biggest downside of this principle is that it leads 
to "late" disunifications; some characters get disunified as the 
committee becomes aware of some issue, leading to the problem of legacy 
data. But it has usefully somewhat limited the further proliferation of 
characters of identical appearance.

The final principle is compatibility. This covers being able to 
round-trip from certain legacy encodings. This principle may force some 
disunifications that otherwise might not have happened, but it also 
isn't a panacea: there are legacy encodings that are mutually 
incompatible, so that one needs to make a choice which one to support. 
TeX being a "glyph based" system looses out here in comparison to legacy 
plain-text character encoding systems such as the 8859 series of ISO/IEC 

Some unification among punctuation marks in particular seem to have been 
made on a more ad-hoc basis. This issue is exacerbated by the fact that 
many such systems lack either the wide familiarity of standard writing 
systems (with their tolerance for glyph variation) nor the rigor of 
something like mathematical notation. This leads to the pragmatic choice 
of letting users select either "shape" or "concept" rather than 
"identity"; generally, such ad-hoc solutions should be resisted -- they 
are certainly not to be seen as a precedent for "encoding concepts" 

But such exceptions prove the rule, which leads back to where we 
started: the default position is that Unicode encodes a character 
identity that is not the same as encoding the concept that said 
character is used to represent in writing.


More information about the Unicode mailing list