Encoding italic
Martin J. Dürst via Unicode
unicode at unicode.org
Thu Jan 17 00:27:21 CST 2019
On 2019/01/17 12:38, James Kass via Unicode wrote:
> ( http://www.unicode.org/versions/Unicode11.0.0/ch02.pdf )
>
> "Plain text must contain enough information to permit the text to be
> rendered legibly, and nothing more."
>
> The argument is that italic information can be stripped yet still be
> read. A persuasive argument towards encoding would need to negate that;
> it would have to be shown that removing italic information results in a
> loss of meaning.
Well, yes. But please be aware of the fact that characters and text are
human inventions grown and developed in many cultures over many
centuries. It's not something where a single sentence will make all the
subsequent decisions easy.
So even if you can find examples where the presence or absence of
styling clearly makes a semantic difference, this may or will not be
enough. It's only when it's often or overwhelmingly (as opposed to
occasionally) the case that a styling difference makes a semantic
difference that this would start to become a real argument for plain
text encoding of italics (or other styling information).
To give a similar example, books about typography may discuss the
different shapes of 'a' and 'g' in various fonts (often, the roman
variant uses one shape (e.g. the 'g' with two circles), and the italic
uses the other (e.g. the 'g' with a hook towards the bottom right)). But
just because in this context, these shapes are semantically different,
doesn't mean that they need to be distinguished at the plain text level.
(There are variants for IPA that are restricted to specific shapes,
namely 'ɑ' and 'ɡ', but that's a separate issue.)
> The decision makers at Unicode are familiar with italic use conventions
> such as those shown in "The Chicago Manual of Style" (first published in
> 1906). The question of plain-text italics has arisen before on this
> list and has been quickly dismissed.
>
> Unicode began with the idea of standardizing existing code pages for the
> exchange of computer text using a unique double-byte encoding rather
> than relying on code page switching. Latin was "grandfathered" into the
> standard. Nobody ever submitted a formal proposal for Basic Latin.
> There was no outreach to establish contact with the user community --
> the actual people who used the script as opposed to the "computer nerds"
> who grew up with ANSI limitations and subsequent ISO code pages. Because
> that's how Unicode rolled back then. Unicode did what it was supposed
> to do WRT Basic Latin.
I think most Unicode specialists have chosen to ignore this thread by
this point. In their defense, I would like to point out that among the
people who started Unicode, there were definitely many people who were
very familiar with styling needs. As a simple example, Apple was
interested in styled text from the very early beginning. Others were
very familiar with electronic publishing systems. There were also
members from the library community, who had their own requirements and
character encoding standards. And many must have known TeX and other
kinds of typesetting and publishing software. GML and then SGML were
developed by IBM.
Based by these data points, and knowing many of the people involved, my
description would be that decisions about what to encode as characters
(plain text) and what to deal with on a higher layer (rich text) were
taken with a wide and deep background, in a gradually forming industry
consensus.
That doesn't mean that for all these decisions, explicit proposals were
made. But it means that even where these decisions were made implicitly
(at least on the level of the Consortium and the ISO/IEC and national
standards body committees), they were made with a full and rich
understanding of user needs and technology choices.
This lead to the layering we have now: Case distinctions at the
character level, but style distinctions at the rich text level. Any good
technology has layers, and it makes a lot of sense to keep established
layers unless some serious problem is discovered. The fact that Twitter
(currently) doesn't allow styled text and that there is a small number
of people who (mis)use Math alphabets for writing italics,... on Twitter
doesn't look like a serious problem to me.
> When someone points out that italics are used for disambiguation as well
> as stress, the replies are consistent.
>
> "That's not what plain-text is for." "That's not how plain-text
> works." "That's just styling and so should be done in rich-text."
> "Since we do that in rich-text already, there's no reason to provide for
> it in plain-text." "You can already hack it in plain-text by enclosing
> the string with slashes." And so it goes.
As such, these answers might indeed not look very convincing. But they
are given in the overall framework of text representation in today's
technology (see above). And please note that the end user doesn't ask
for "italics in plain text", they as for "italics on Twitter" or some such.
If you ask for "italics in plain text", then to people understanding the
whole technology stack, that very much sounds as if you grew up with
ASCII and similar plain text limitations, continuing to be a computer
nerd who hasn't yet seen or understood rich text.
> But if variant letter form information is stripped from a string like
> "Jackie Brown", the primary indication that the string represents either
> a person's name or a Tarantino flick title is also stripped. "Thorstein
> Veblen" is either a dead economist or the name of a fictional yacht in
> the Travis McGee series. And so forth.
In probably around 99% or more of the cases, the semantic distinction
would be obvious from the context. Also, for probably at least 90% of
the readership, the style distinction alone wouldn't induce a semantic
distinction, because most of the readers are not familiar with these
conventions.
(If you doubt that, please go out on the street and ask people what
italics are used for, and count how many of them mention film titles or
ship names.)
(And just while we are at it, it would still not be clear which of
several potential people named "Jackie Brown" or "Thorstein Veblen"
would be meant.)
> Computer text tradition aside, nobody seems to offer any legitimate
> reason why such information isn't worthy of being preservable in
> plain-text. Perhaps there isn't one.
See above.
> I'm not qualified to assess the impact of italic Unicode inclusion on
> the rich-text world as mentioned by David Starner. Maybe another list
> member will offer additional insight or a second opinion.
I'd definitely second David Starner on this point. The more options one
has to represent one and the same thing (italic styling in this thread),
the more complex and error-prone the technology gets.
Regards, Martin.
More information about the Unicode
mailing list