“plain text styling”…

Wed Jan 4 18:53:40 CST 2023

More or less regularly there are (informal) requests on this list for encoding (new) control codes or control code sequences for text styling (like bold, italics, text colour, …) also for ”plain text”. This instead of using such things RTF, SGML, HTML, ODF, etc. In the latter, the style (and other) controls are given as strings of printable characters (like <b>, </b>), not involving control characters. An important aspect of the (informal) proposals that pop up is that the formatting encoding is more lightweight, but also less powerful, than HTML/CSS, RTF, ODF, etc.

Not surprisingly, this is not a new idea. It has popped up several times during (computer) history, going back at least 50 years, probably more.

The advantage this approach has is that by using a separate class of characters, no substring of printable characters (including SP, HT), no substring of printable characters can be confused with controls for text styling.

As I've mentioned long before, there is no need to reinvent that approach (unless you really, really want to...). There are two of them that are still “alive”, one of which is given by a standard referenced by the Unicode/10646 standard(s). One is Teletext (yes, it is still alive, though very much on the decline), now based on the ETSI EN 300 706 standard and is available also over DVB. Teletext is not a standard referenced by Unicode. The other, referenced by Unicode, is ECMA-48 (a.k.a. ISO/IEC 6429, but I can never memorise that number).

Further, the text in Teletext is embedded in a rather complex out of line (i.e., outside of the text), but not very powerful, protocol. Even line breaks are in the out of line protocol, not inline, i.e., there are no CR, LF or other line break characters (and there is in reality actually no ESC character either). Though some of the text styling is inline, some is not (such as additional colours, and requesting a proportional font) and are instead given out of line in the protocol. (Nit: Teletext relies very much on code page switching, but **none** of that switching is done by escape sequences; there is not even any (real) ESC character, let alone any escape sequences.)

ECMA-48, however, is fully inline with the text. It is mostly still alive via terminal emulators, and for terminal emulators ECMA-48 will continue to be used for foreseeable time. But the text styling part of ECMA-48 (plus proposed extensions/updates) can very well also be used as a text file storage format, allowing styled text documents to represent the styling via ECMA-48 control sequences. That it is “old” does not matter. Other parts of ECMA-48 concern things like cursor movement, window scrolling, and even (terminal emulator window) erase controls, and those "non-styling" controls are not suitable for styled text file storage. Unfortunately, the ECMA-48 text does not make such clear distinction; it has to be inferred.

You may think that ECMA-48 is old-fashioned. And, yes, it was produced quite some time ago. But so was SGML, the origin of HTML and XML. The styling commands in ECMA-48 are a bit cryptic, but they are also comparatively compact, especially in comparison to HTML/CSS.

ECMA-48 styling controls maybe was not in origin intended as a storage format (but nothing in ECMA-48 prevents that), but as an output format (cmp. the ‘man’ command in Unix/Linux, where the input/storage is an nroff typesetting command file, and the output uses ECMA-48 styling). But now, with text editors where style is edited by (selecting a substring and) using a menu selection or a keyboard shortcut, there is no technical reason why ECMA-48 styling cannot be used as a styled text file storage format.

It is, however, a while ago since the last update to ECMA-48, and that shows. I’ve compiled a proposed update for the text styling part of ECMA-48: https://github.com/kent-karlsson/control/blob/main/ecma-48-style-modernisation-2022.pdf <https://github.com/kent-karlsson/control/blob/main/ecma-48-style-modernisation-2022.pdf>
As mentioned, using control codes for encoding styling is not an idea unique to ECMA-48. Two others are ISCII and Teletext. They do have some peculiarities in their formatting controls. However, by doing certain additions to the formatting controls in ECMA-48 these peculiarities can be covered. Conversions from ISCII and Teletext are hinted in tables in the paper referenced above. A few more hinted mappings (for PETSCII, ATASCII, EBCDIC, and various ISO registered sets of control codes) are given in https://www.unicode.org/L2/L2022/22013r-c0-c1-stability.pdf <https://www.unicode.org/L2/L2022/22013r-c0-c1-stability.pdf>.

So ECMA-48 (esp. with updates as in the above reference) is a possible (relatively) light-weight text formatting alternative, in between the “pure plain text” and “powerful text formatting” (like MS Word, ODF, HTML/CSS). And, it fits in contexts that are otherwise ”plain text”.

ECMA-48 (with some additions) also enables proper conversion of older control code sets, some of which include text styling, to modern character coding sets, without loosing or mistreating the “old” control codes.

/Kent Karlsson

PS

Using ECMA-48 styling is fully compatible with the math expression representation (C1 alternative), which is a completely separate proposal, that I sent an email about last month.

PPS

ECMA-48 is a bit like Unicode/10646, in that it is a bit of a smorgasbord. Implementors may support the parts they decide to support. Implementations need not support everything specified.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20230105/9d9ccc21/attachment.htm>