Unicode characters unification
Asmus Freytag via Unicode
unicode at unicode.org
Mon May 28 23:40:49 CDT 2018
In the discussion leading up to this it has been implied that Unicode
encodes / should encode concepts or pure shape. And there's been some
confusion as to were concerns about sorting or legacy encodings fit in.
Time to step back a bit:
Primarily the Unicode Standard encodes by character identity - something
that is different from either the pure shape or the "concept denoted by
For example, for most alphabetic characters, you could say that they
stand for a more-or-less well-defined phonetic value. But Unicode does
not encode such values directly, instead it encodes letters - which in
turn get re-purposed for different sound values in each writing system.
Likewise, the various uses of period or comma are not separately encoded
- potentially these marks are given mappings to specific functions for
each writing system or notation using them.
Clearly these are not encoded to represent a single mapping to an
external concept, and, as we will see, they are not necessarily encoded
directly by shape.
Instead, the Unicode Standard encodes character identity; but there are
a number of principled and some ad-hoc deviations from a purist
implementation of that approach.
The first one is that of forcing a disunification by script. What
constitutes a script can be argued over, especially as they all seem to
have evolved from (or been created based on) predecessor scripts, so
there are always pairs of scripts that have a lot in common. While an
"Alpha" and an "A" do have much in common, it is best to recognize that
their membership in different scripts leads to important differences so
that it's not a stretch to say that they no longer share the same identity.
The next principled deviation is that of requiring case pairs to be
unique. Bicameral scripts, (and some of the characters in them),
acquired their lowercase at different times, so that the relation
between the upper cases and the lower cases are different across
scripts, and gives rise to some exceptional cases inside certain scripts.
This is one of the reasons to disunify certain bicameral scripts. But
even inside scripts, there are case pairs that may share lowercase forms
or may share uppercase forms, but said forms are disunified to make the
pairs separate. The two first principles match users expectations in
that case changes (largely) work as expected in plain text and that
sorting also (largely) matches user expectation by default.
The third principle is to disunify characters based on line-breaking or
line-layout properties. Implicit in that is the idea that plain text,
and not markup, is the place to influence basic algorithms such as
line-breaking and bidi layout (hence two sets of Arabic-Indic digits).
Once can argue with that decision, but the fact is, there are too many
places where text exist without the ability to apply markup to go
entirely without that support.
The fourth principle is that of differential variability of appearance.
For letters proper, their identity can be associated with a wide range
of appearances from sparse to fanciful glyphs. If an entire piece of
text (or even a given word) is set using a particular font style,
context will enable the reader to identify the underlying letter, even
if the shape is almost unrelated to the "archetypical shape" documented
in the Standard.
When letters or marks get re-used in notational systems, though, the
permissible range of variability changes dramatically - variations that
do not change the meaning of a word in styled text, suddenly change the
meaning of text in a certain notational system. Hence the disunification
of certain letters or marks (but not all of them) in support of
The fifth principle appears to be to disunify only as far as and only
when necessary. The biggest downside of this principle is that it leads
to "late" disunifications; some characters get disunified as the
committee becomes aware of some issue, leading to the problem of legacy
data. But it has usefully somewhat limited the further proliferation of
characters of identical appearance.
The final principle is compatibility. This covers being able to
round-trip from certain legacy encodings. This principle may force some
disunifications that otherwise might not have happened, but it also
isn't a panacea: there are legacy encodings that are mutually
incompatible, so that one needs to make a choice which one to support.
TeX being a "glyph based" system looses out here in comparison to legacy
plain-text character encoding systems such as the 8859 series of ISO/IEC
Some unification among punctuation marks in particular seem to have been
made on a more ad-hoc basis. This issue is exacerbated by the fact that
many such systems lack either the wide familiarity of standard writing
systems (with their tolerance for glyph variation) nor the rigor of
something like mathematical notation. This leads to the pragmatic choice
of letting users select either "shape" or "concept" rather than
"identity"; generally, such ad-hoc solutions should be resisted -- they
are certainly not to be seen as a precedent for "encoding concepts"
But such exceptions prove the rule, which leads back to where we
started: the default position is that Unicode encodes a character
identity that is not the same as encoding the concept that said
character is used to represent in writing.
More information about the Unicode