Encoding italic (was: A last missing link)

Kent Karlsson via Unicode unicode at unicode.org
Tue Jan 22 17:26:09 CST 2019


Ok. One thing to note is that escape sequences (including control sequences,
for those who care to distinguish those) probably should be "default
ignorable" for display. Requiring, or even recommending, them to be default
ignorable for other processing (like sorting, searching, and other things)
may be a tall order. So, for display, (maximal) substrings that match:

\u001B[\u0020-\002F]*[\u0030-\007E]|
(\u001B'['|\009B)[\u0030-\003F]*[\u0020-\002F]*[\u0040-\007E]

should be default ignorable (i.e. invisible, but a "show invisibles" mode
would show them; not interpreted ones should be kept, even if interpreted
ones need not, just (re)generated on save). That is as far as Unicode
should go.

Some may be interpreted, this thread focuses on italic, but also bold
and underlined. There is a whole bunch of "style" control sequences
(those that have "m" at the end of the sequence) specified, and terminal
emulators implement several of them, but not all.

As for editing, if "style" control sequences à la ISO 6429 were to be
supported in text editors, I would NOT expect users to type in those
escape/control sequences in any way, but use "ctrl/command-i" (etc.) or
menu commands as editors do now, and the representation as esc-sequences
be kept under wraps (and maybe only present in files, not in the internal
representation during editing), and not seen unless one starts to analyse
the byte sequences in files. So, even if you don't like this esc-sequence
business:
1) It would not be seen by most users, mostly by programmers (the same
goes for other ways of representing this, be it HTML, .doc, or whatever.
2) It is already standardised, and one can make (a slightly inaccurate)
argument that it is "plain text".

What one would need to do is:
1) Prioritise which "style" control sequences should be interpreted
(rather than be ignored).
2) Lobby to "plain" text editor makers to support those styles,
representing them (in files) as standard control sequences.

A selection of already standardised style codes (i.e., for control
sequences that end in ²m²):
 
0       default rendition (implementation-defined)

1       bold
(2      lean)
22      normal intensity (neither bold nor lean)

3       italicized
23      not italicized (i.e. upright)

4       singly underlined
(21     doubly underlined)
24      not underlined (neither singly nor doubly)

(9      crossed-out (strikethrough))
(29     not crossed out)

If you really want to go for colour as well (RGB values in 0‹255)
(colour is popular in terminal emulators...):
 
(30-37  foreground: black, red, green, yellow, blue, magenta, cyan, white)
38      foreground colour as RGB. Next arguments 2;r;g;b
39      default foreground colour (implementation-defined)

(40-47  background: black, red, green, yellow, blue, magenta, cyan, white)
48      background colour as RGB. Next arguments 2;r;g;b
49      default background colour (implementation-defined)

There are some more (including some that assume a small font palette, for
changing font). But far enough for now. Maybe too far already. But do not
allow interpreting multiple style attribute codes in one control sequence;
quite unnecessary.


/Kent K



Den 2019-01-21 21:46, skrev "Doug Ewell via Unicode" <unicode at unicode.org>:

> Kent Karlsson wrote:
> 
>> There is already a standardised, "character level" (well, it is from
>> a character standard, though a more modern view would be that it is
>> a higher level protocol) way of specifying italics (and bold, and
>> underline, and more):
>> 
>> \u001b[3mbla bla bla\u001b[0m
>> 
>> Terminal emulators implement some such escape sequences.
> 
> And indeed, the forthcoming Unicode Technical Note we are going to be
> writing to supplement the introduction of the characters in L2/19-025,
> whether next year or later, will recommend ISO 6429 sequences like this
> to implement features like background and foreground colors, inverse
> video, and more, which are not available as plain-text characters.
>  
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
> 





More information about the Unicode mailing list