Encoding italic

Thu Jan 17 00:27:21 CST 2019

On 2019/01/17 12:38, James Kass via Unicode wrote:

> ( http://www.unicode.org/versions/Unicode11.0.0/ch02.pdf )
> 
> "Plain text must contain enough information to permit the text to be 
> rendered legibly, and nothing more."
> 
> The argument is that italic information can be stripped yet still be 
> read.  A persuasive argument towards encoding would need to negate that; 
> it would have to be shown that removing italic information results in a 
> loss of meaning.

Well, yes. But please be aware of the fact that characters and text are 
human inventions grown and developed in many cultures over many 
centuries. It's not something where a single sentence will make all the 
subsequent decisions easy.

So even if you can find examples where the presence or absence of 
styling clearly makes a semantic difference, this may or will not be 
enough. It's only when it's often or overwhelmingly (as opposed to 
occasionally) the case that a styling difference makes a semantic 
difference that this would start to become a real argument for plain 
text encoding of italics (or other styling information).

To give a similar example, books about typography may discuss the 
different shapes of 'a' and 'g' in various fonts (often, the roman 
variant uses one shape (e.g. the 'g' with two circles), and the italic 
uses the other (e.g. the 'g' with a hook towards the bottom right)). But 
just because in this context, these shapes are semantically different, 
doesn't mean that they need to be distinguished at the plain text level.
(There are variants for IPA that are restricted to specific shapes, 
namely 'ɑ' and 'ɡ', but that's a separate issue.)

> The decision makers at Unicode are familiar with italic use conventions 
> such as those shown in "The Chicago Manual of Style" (first published in 
> 1906).  The question of plain-text italics has arisen before on this 
> list and has been quickly dismissed.
> 
> Unicode began with the idea of standardizing existing code pages for the 
> exchange of computer text using a unique double-byte encoding rather 
> than relying on code page switching.  Latin was "grandfathered" into the 
> standard.  Nobody ever submitted a formal proposal for Basic Latin. 
> There was no outreach to establish contact with the user community -- 
> the actual people who used the script as opposed to the "computer nerds" 
> who grew up with ANSI limitations and subsequent ISO code pages. Because 
> that's how Unicode rolled back then.  Unicode did what it was supposed 
> to do WRT Basic Latin.

I think most Unicode specialists have chosen to ignore this thread by 
this point. In their defense, I would like to point out that among the 
people who started Unicode, there were definitely many people who were 
very familiar with styling needs. As a simple example, Apple was 
interested in styled text from the very early beginning. Others were 
very familiar with electronic publishing systems. There were also 
members from the library community, who had their own requirements and 
character encoding standards. And many must have known TeX and other 
kinds of typesetting and publishing software. GML and then SGML were 
developed by IBM.

Based by these data points, and knowing many of the people involved, my 
description would be that decisions about what to encode as characters 
(plain text) and what to deal with on a higher layer (rich text) were 
taken with a wide and deep background, in a gradually forming industry 
consensus.

That doesn't mean that for all these decisions, explicit proposals were 
made. But it means that even where these decisions were made implicitly 
(at least on the level of the Consortium and the ISO/IEC and national 
standards body committees), they were made with a full and rich 
understanding of user needs and technology choices.

This lead to the layering we have now: Case distinctions at the 
character level, but style distinctions at the rich text level. Any good 
technology has layers, and it makes a lot of sense to keep established 
layers unless some serious problem is discovered. The fact that Twitter 
(currently) doesn't allow styled text and that there is a small number 
of people who (mis)use Math alphabets for writing italics,... on Twitter 
doesn't look like a serious problem to me.

> When someone points out that italics are used for disambiguation as well 
> as stress, the replies are consistent.
> 
> "That's not what plain-text is for."  "That's not how plain-text 
> works."  "That's just styling and so should be done in rich-text." 
> "Since we do that in rich-text already, there's no reason to provide for 
> it in plain-text."  "You can already hack it in plain-text by enclosing 
> the string with slashes."  And so it goes.

As such, these answers might indeed not look very convincing. But they 
are given in the overall framework of text representation in today's 
technology (see above). And please note that the end user doesn't ask 
for "italics in plain text", they as for "italics on Twitter" or some such.

If you ask for "italics in plain text", then to people understanding the 
whole technology stack, that very much sounds as if you grew up with 
ASCII and similar plain text limitations, continuing to be a computer 
nerd who hasn't yet seen or understood rich text.

> But if variant letter form information is stripped from a string like 
> "Jackie Brown", the primary indication that the string represents either 
> a person's name or a Tarantino flick title is also stripped.  "Thorstein 
> Veblen" is either a dead economist or the name of a fictional yacht in 
> the Travis McGee series.  And so forth.

In probably around 99% or more of the cases, the semantic distinction 
would be obvious from the context. Also, for probably at least 90% of 
the readership, the style distinction alone wouldn't induce a semantic 
distinction, because most of the readers are not familiar with these 
conventions.

(If you doubt that, please go out on the street and ask people what 
italics are used for, and count how many of them mention film titles or 
ship names.)

(And just while we are at it, it would still not be clear which of 
several potential people named "Jackie Brown" or "Thorstein Veblen" 
would be meant.)

> Computer text tradition aside, nobody seems to offer any legitimate 
> reason why such information isn't worthy of being preservable in 
> plain-text.  Perhaps there isn't one.

See above.

> I'm not qualified to assess the impact of italic Unicode inclusion on 
> the rich-text world as mentioned by David Starner.  Maybe another list 
> member will offer additional insight or a second opinion.

I'd definitely second David Starner on this point. The more options one 
has to represent one and the same thing (italic styling in this thread), 
the more complex and error-prone the technology gets.

Regards,    Martin.