Encoding italic

Doug Ewell via Unicode unicode at unicode.org
Sun Feb 10 16:49:54 CST 2019


Egmont Koblinger wrote:

> There are a lot of problems with these escape sequences, and if you go
> for a potentially new standard, you might not want to carry these
> problems.

As others have pointed out, I am suggesting the use of some profile of ISO 6429 within plain text to implement these features about which there is disagreement whether they belong in plain text or not.

I am very definitely NOT proposing that anything be added to Unicode or 10646, nor that an all-new standard be created.

> There is not a well-defined framework for escape sequences.

I thought ISO 6429 defined things rather clearly, if verbosely.

> In this particular case you might say it starts with ESC [ and ends
> with the letter 'm', but how do you know where to end the sequence if
> that letter 'm' just doesn't arrive?

Well, what do you do in HTML if the closing '>' never arrives?

If it's simply a matter of the text coming to an end before the 'm' arrives, then it doesn't matter. If the 'm' (or other final code unit for other commands) is dropped but the sequence goes on, like <ESC>[3This is italicized<ESC>[m, then gosh, I don't know offhand what the standard says. It might be worthwhile to try looking it up, or seeing what implementations do, or defining it clearly in the profile.

> Terminal emulators have extremely complex tables for parsing (and
> still many of them get plenty of things wrong). It's unreasonable for
> any random small utility processing Unicode text to go into this
> business of recognizing all the well-known escape sequences, not even
> to the extent to know where they end.

Perhaps interestingly, I wrote a random small utility many years ago that displayed ISO 6429 sequences on a Windows console, back in the dark ages between ANSI.SYS and Windows 10 support for 6429. It didn't cover the entire standard, nor could it, but a decent subset. It understood where sequences ended, even unknown ones, because that is all laid out in the standard.

> Whatever is designed should be much more easily parseable. Should you
> say "everything from ESC[ to m", you'll cause a whole bunch of
> problems when a different kind of escape sequence gets interpreted as
> Unicode.

I'm afraid I don't understand this statement.

> A parser, by the way, would also have to interpret combined sequences
> like ESC[3;0;1m or alike, for which I don't see a good reason as
> opposed to having separate sequences for each.

That's easy:

3 = turn on italics
0 = turn off all special styling, including italics
1 = turn on bold (or intense, whichever the output device supports)

It's a silly sequence, because why would you turn on an attribute and then immediately turn it off before using it? But silly though it may be, it's well-formed and very easy to parse. My random small utility had no problem with it.

> Also, it should be carefully evaluated what to do with C1 (U+009B)
> instead of the C0 ESC[ opening for an escape sequence – here terminal
> emulators vary. These just make everything even more cumbersome.

Why would they vary? CSI encoded as <1B 5B> or as <9B> is exactly the same. Again, this is very clear in the standard.

> ECMA-48 8.3.117 specifies ESC[1m as "bold or increased intensity".
> It's only nowadays that most terminal emulators support 256 colors and
> some even support 16M true colors that some emulators try to push for
> this bit unambiguously meaning "bold" only, whereas in most emulators
> it means "both bold and increased intensity". [...]

Why would we expect every displayed and printed page to look identical? That's not going to happen no matter what encoding mechanism you use for "bold" and "intense" and the rest. Not all HTML pages look identical either.

> Should this scheme be extended for colors, too? What to do with the
> legacy 8/16 as well as the 256-color extensions wrt. the color
> palette?

Why not?

> Should Unicode go into the business

Nope. Unicode should do nothing about this.

> For 256-colors and truecolors, there are two or three syntaxes out
> there regarding whether the separator is a colon or a semicolon.
> ECMA-48 doesn't say anything about it, TUI T.416 does, although it's
> absolutely not clear. See e.g. the discussion at the comment section
> of https://gist.github.com/XVilka/8346728 , in Dec 2018, we just
> couldn't figure out which syntax exactly TUI T.416 wants to say.

That sounds like someone should send a question to ITU-T. Exegesis would
perhaps be more productive than despair.

> Moreover, due to a common misinterpretation of the spec, one of the
> positional parameters are often omitted.

That's a decision designers and implementers are sometimes faced with: should we remain bug-compatible with other implementations, or follow the straight and narrow path? I remember browsers going through that era too.

> Some terminal emulators have made up some new SGR modes, e.g. ESC[4:3m
> for curly underline. What to do with them?

Should we be extension-compatible with other implementations, or following the straight and narrow path? Another decision that is not unique to ISO 6429.

> Where to draw the line what to add to Unicode and what not to? Will
> Unicode possibly be a bottleneck of further improvements in terminal
> emulators, because from now on every new mode we figure out we'd like
> to have in terminals should go through some Unicode committee?

I think you know the answer to this by now.

>> This mechanism [...] is already supported
>> as widely as any new Unicode-only convention will ever be.
>
> I truly doubt this, these escape sequences are specific to terminal
> emulation, an extremely narrow subset of where Unicode is used and
> rich text might be desired.

That's true. Probably next to nobody is using ISO 6429 sequences for plain text intended for print, just as next to nobody is using the proposed VS14 mechanism or Andrew West's Plane 14 mechanism. My suggestion was to document the ISO 6429 approach, run it up the flagpole, and see if anyone salutes.

> Or, if it wants to adopt some already existing technology, I find
> HTML/CSS a much better starting point.

Q: How can we represent italics in plain text?
A: Use rich text.


Kent Karlsson wrote:

>> •	Underline on: ESC [4m 
>    (implies turning double underline off) 
>   Underline, double: ESC [21m 
>    (implies turning single underline off) 

I deliberately left out single and double underlining, and many other features of ISO 6429 SGR (such as Fraktur). The email was not intended as a final proposal. I do think it would be strange for single and double underlining not to cancel each other out.

> Note that these do NOT nest (no stack...), just state changes for the
> relevant PART of the "graphic" (i.e. style) state. So the approach in
> that regard is quite different from the approach done in HTML/CSS.

I don't regard that as either a bug or a feature. I certainly don't expect that every such mechanism has to nest, simply because SGML and its descendants are designed that way.


--
Doug Ewell | Thornton, CO, US | ewellic.org





More information about the Unicode mailing list