Unicode encoding philosophy

Erik Carvalhal Miller ecm.unicode at gmail.com
Tue Oct 10 19:39:55 CDT 2023


The Unicode Standardʼs core specification, in chapter 23 (“Special
Areas and Format Characters”), §23.4 (“Variation Selectors”), is a
little vague about variation sequences, stating that in “special
circumstances” within “plain text contexts” they are used “for
specifying a restriction on the set of glyphs that are used to
represent a particular character”.  My take is that they are used in
situations where the base character and its selectable variant(s)
definitely represent the same abstract character identity yet, in some
contexts at least, donʼt entirely share 100% exactly the same identity
— a sort of have‐your‐cake‐and‐eat‐it‐too.  For some such situations,
Unicode might designate distinct code points instead; it seems that,
when the UTC is making an encoding decision about the sort of
situation that might call for variation selectors, the ultimate choice
between variation selectors and distinct code points is influenced by
questions of compatibility and preëxisting encodings.  With the base
characters at issue in L2/23-212 already well established in Unicode,
it makes sense to consider variation sequences for distinguishing the
desired behaviors instead of assigning new code points.

It turns out that the Unicode Standard does formally and explicitly
declare a set of encoding principles, in the core specificationʼs
chapter 2 “General Structure”, §2.2 “Unicode Design Principles”.  One
of them in particular, Plain Text, would appear to be key in the
opposition to a scheme for folding italicization, a rich‐text feature,
into the Unicodeʼs character‐encoding standard.  But, you may ask,
what about the mathematical Latin and Greek alphanumeric symbols, rich
in implied typographical styles including italic?

Thereʼs a temptation to consider Unicodeʼs acceptance of those pesky
symbols, which the Consortium has emphasized is not a precedent for
the inclusion of further plain‐text italics within the Standard, as an
exception to the rules; but actually I find the symbolsʼ encoding
quite consistent with overall Unicode logic.  I trust youʼre on board
with the notion of including them in the standard one way or another
on the basis of distinct semantics, that for example an italic
variable A can mean something distinct from a bold variable A.  Letʼs
consider an equation that youʼll probably recognize, font support
willing: 𝐸 = 𝑚𝑐².  Thanks to the power of Unicode, we could use it
in the same plain‐text document as, say, ℰ = 𝐦𝕔² while keeping both
equations distinct.  So, you may be thinking, thatʼs what you want
with a more generalized italics scheme, via variation selectors; after
all, in nontechnical text, styles such as italics convey meaning such
as “emphasis” or “title of a work” or “section heading” or “foreign
expression”.  But there is a fundamental difference!

In the math symbols, the stylization helps define a characterʼs
distinct identity, so that we donʼt mix up variables 𝐸 and ℰ; in more
general text usage, the styles donʼt change the character identity,
but rather the styles themselves convey meaning independently of any
characters appearing in the styled run.  This becomes more obvious
with styles such as outlining and background color where none of the
actual glyphs change and any spacing invisible characters (such as
U+0020 SPACE) are clearly part of the style run.  In normal italic
styling, yes, the visible charactersʼ glyphs do change, but they do so
because those characters are in the midst of an italic run with a
beginning and an end, not because a letter such as E in the midst of
such a run has a different identity from that it has outside such a
run.  For the math symbols, choices such as italic or bold are
character‐by‐character decisions; for example, in 𝐸 = 𝑚𝑐², the 𝑚
and the 𝑐, though adjacent, are each independently italic — compare
with the commutatively equivalent 𝐸 = 𝑐²𝑚, where the superscript 2
remains upright — whereas in more general text usage, adjacent italics
such as in an italicized word “emcee” are not a character‐by‐character
decision but the result of a decision to italicize a whole span of
text.

And that brings us to another of Unicodeʼs design principles, Logical
Order.  For general text, italics span a run with a defined beginning
and end, such as in the HTML representation <i>annus mirabilis</i>; to
use a character‐by‐character representation such as
<i>a</i><i>n</i><i>n</i><i>u</i><i>s</i><i>m</i><i>i</i><i>r</i><i>a</i><i>b</i><i>i</i><i>l</i><i>i</i><i>s</i>
or &aital;&nital;&nital;&uital;&sital;&mital;&iital;&rital;&aital;&bital;&iital;&lital;&iital;&sital;
or a&VS14;n&VS14;n&VS14;u&VS14;s&VS14;m&VS14;i&VS14;r&VS14;a&VS14;b&VS14;i&VS14;l&VS14;i&VS14;s&VS14;
is counterintuitive and, I can now attest, tedious.  Such runs of
italics are inherently stateful in their conceptualization, and rich
text implements them statefully,  This statefulness applies even to
pre‐computer days of metal type:  A compositor about to set a run of
italic type would turn to a case of italics from which to pick out the
next several glyphs, then turn back to a non‐italic case when that run
was complete, rather than serially start and finish using the italic
case many times in a row.  To encode spans of italics
character‐by‐character, whether with variation sequences or atomic
characters, violates the logical order of general text.

And then, besides Plain Text and Logical Order, thereʼs the Stability
principle, which comes into play with canonical equivalence when you
start to play with canonical composition and decomposition of the
various accented characters to which italics should be applicable (as
alluded to in the aforementioned chapter 23, §23.4).  But I urge you
to give those design principles a look.

On Wed, Oct 4, 2023 at 1:59 PM William_J_G Overington via Unicode
<unicode at corp.unicode.org> wrote:
>
> I have been reading the following.
>
> https://www.unicode.org/L2/L2023/23212-quotes-svs-proposal.pdf
>
> I am not an expert on this at all. It looks good and I hope it becomes
> implemented.
>
> What puzzles me though, is that structurally the proposal seems to have
> much the same encoding philosophy as a suggestion proposed by me in that
> they both would allow a variation selector to be used so as to conserve
> in plain text information that is typically these days conserved in rich
> text and gets lost if plain text is used. In my proposal, using a
> variation selector to conserve in a plain text document information
> about the use of italics in some text.
>
> My proposal was rejected, quite strongly.
>
> So, deep down, what please is the Unicode encoding philosophy that
> allows variation selectors to be used to conserve some information, yet
> not other information, in plain text?
>
> William Overington
>
> Wednesday 4 October 2023
>



More information about the Unicode mailing list