Italics get used to express important semantic meaning, so unicode should support them

Christian Kleineidam christian.kleineidam at gmail.com
Fri Dec 11 06:57:23 CST 2020


In the FAQ on Ligatures it's written "The mathematical letters and digits
are meant to be used only in mathematics, where the distinction between a
plain and a bold letter is fundamentally semantic rather than stylistic."
This suggests that the spirit of Unicode includes the intention to be able
to represent semantic meaning.

On Wikidata, we have the open problem of what to do with academic articles
that have italics in their official title. We store for example the paper
https://www.wikidata.org/wiki/Q33988883 which according to what the
publisher writes on http://www.biochemsoctrans.org/content/33/4/582 has
italics as part of it's proper name. In Wikidata we want to be able to
store the semantic meaning.
This gives us the choice between either using In Wikidata, to either  list
the paper as "Evidence suggesting that Homo neanderthalensis contributed
the H2 MAPT haplotype to Homo sapiens" or "Evidence suggesting that
𝐻𝑜𝑚𝑜 𝑛𝑒𝑎𝑛𝑑𝑒𝑟𝑡ℎ𝑎𝑙𝑒𝑛𝑠𝑖𝑠 contributed the H2 𝑀𝐴𝑃𝑇
haplotype to 𝐻𝑜𝑚𝑜 𝑠𝑎𝑝𝑖𝑒𝑛𝑠" which uses the mathematical
characters against recommendations while the website lists it as "Evidence
suggesting that Homo neanderthalensis contributed the H2 MAPT haplotype to Homo
sapiens".

In scientific articles like that the ability to represent italics is needed
to express all of the semantic meaning that's contained in the title. In
contrast to properties like font-size, italics start to be used in the real
world to express semantic meaning. For a project like Wikidata that cares
about storing the semantic meaning of the title of an academic paper that
unicode problematic as it leads us to lose information.

You might say that if unicode doesn't serve the needs of Wikidata to store
the semantic content of the texts we care about, we should add additional
formatting on-top of unicode. Between RTF, Markdown, SGML, HTML, XML and
Wikitext there are multiple different formats we could use on Wikidata to
potentially represent italics. If we would however choose any one of them
that would make it harder for data-reusers who use another format to
interact with our data as they would need to run a parser over the data
which increases their code complexity and makes it harder to interact with
our data.

Official style guidelines like the Chicago Manual of Style (18th edition)
specify that certain italics should be used to express certain semantic
meaning:

22.1.3 Other Types of Names
Other types of names also follow specific patterns for capitalization, and
some require italics.

22.2.1 Foreign-Language Terms
Italicize isolated words and phrases in foreign languages likely to be
unfamiliar to readers of English, and capitalize them as in their language.

22.3.2.1 ITALICS. Italicize the titles of most longer works, including the
types listed here. An initial the should be roman and lowercase before
titles of periodicals, or when it is not considered part of the title. For
parts of these works and shorter works of the same type, see 22.3.2.2.

The inability to follow the recommendations of the Chicago Manual of Style
to express semantic meaning in italics means that unicode fails in it's
mission to be able to express all semantic distinctions. This means that
it's technically impossible to follow the Chicago Manual of Style in code
comments of programming code that are in unicode.

Outside of specialized needs like those of Wikidata and programmers who
might want to follow the Chicago Manual of Style in context  the inability
of unicode to represent italics and bold of texts makes life harder for
average users as well. Web browsers can't offer their users the ability to
format a part of the text as italics or bold. As a result many users don't
know how to italicize or bold text when they write online as different
website use different standards. Many online systems break WYSIWYG for
italics and bold which makes it harder for non-technical users to use them
to express themselves.
If Unicode would support italics and bold, the browser could make it easy
for users to have italics or boldness. Even smartphone would have the
option to offer a user to italicize or bold a text in the menu that
currently allows copying and pasting.

Websites like https://yaytext.com/bold-italic/ get used by users to express
themselves in italics and bold on platforms like Facebook and Twitter that
use Unicode without additional formatting.

Having to use the unofficial workaround of mathematical letters is
undesirable because it means that software like screen readers is less
likely to interact well with the resulting text.

Proposal of a solution:

In today's usage italics often have semantic meaning. There are many cases
where it's desirable that a user can express such meaning but where there's
no intention to give the user control over features such as font size that
the user gets when HTML or RTF is used as format. With the symbol for
Right-to-Left text there's a precedent in unicode for having signs that
manipulate multiple following characters. At the time of the design italics
weren't used for expressing fundamentally semantic meaning such as "Homo
neanderthalensis" referring to a a species as it's used in the title of the
above paper.

Create a new unicode character for begin/end italic formatting and
begin/end bold formatting that works like the unicode character for the
Right-to-Left switch.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201211/5a35669a/attachment.htm>


More information about the Unicode mailing list