Encoding italic

Kent Karlsson via Unicode unicode at unicode.org
Sat Feb 9 05:51:19 CST 2019


Den 2019-02-08 22:29, skrev "Egmont Koblinger via Unicode"
<unicode at unicode.org>:

> (Mind you, I don't find it a good idea to add italic and whatnot
> formatting support to Unicode at all... but let's put aside that now.)

I don't think Doug mean to "add it to the Unicode standard", just to
have a summary of "handy esc-sequences (actually command-sequences)
for simple styling of text" picked from long-standing (text level...)
standards.

> There are a lot of problems with these escape sequences, and if you go
> for a potentially new standard, you might not want to carry these
> problems.
> 
> There is not a well-defined framework for escape sequences. In this
> particular case you might say it starts with ESC [ and ends with the
> letter 'm', but how do you know where to end the sequence if that
> letter 'm' just doesn't arrive? Terminal emulators have extremely

There is an overriding "basic (overall) syntax" for esc-seq/
command-sequences that do not include a string argument (like OSC,
APC, ...). IIUC it is (originally as byte sequences, but here as
character sequences):

\u001B[\u0020-\002F]*[\u0030-\007E]| 
(\u001B'['|\009B)[\u0030-\003F]*[\u0020-\002F]*[\u0040-\007E] 

(no newline or carriage return in there). True, that has no direct
limit, but it would not be unreasonable to set a limit of (say)
max 30 characters. Potential (i.e. starting with ESC) esc-"sequences"
that do not match the overall syntax or are too long can simply be
rendered as is (except for the ESC itself). The esc/command sequences
(that match) but are not interpreted should be ignored in "normal"
(not "show invisibles" mode) display.

They are unlikely to be "default ignored" by such things as sorting
(and should preferably be filtered out beforehand, if possible). But
if we compare to other rich text editors, the command sequences should
be ignored by (interactive) searching, just like HTML tags are ignored
in interactive searching (the internal representation "skipping" the
HTML tags in one way or another). HTML tags should also (when text
known to be HTLM) filtered out before doing such things as sorting.

> complex tables for parsing (and still many of them get plenty of
> things wrong). It's unreasonable for any random small utility
> processing Unicode text to go into this business of recognizing all
> the well-known escape sequences, not even to the extent to know where
> they end. Whatever is designed should be much more easily parseable.
> Should you say "everything from ESC[ to m", you'll cause a whole bunch
> of problems when a different kind of escape sequence gets interpreted
> as Unicode.

The escape/command sequences would not be part of Unicode (standard).

> A parser, by the way, would also have to interpret combined sequences
> like ESC[3;0;1m or alike, for which I don't see a good reason as
> opposed to having separate sequences for each. Also, it should be

Formally covered by the (non-Unicode) standards, but optional (IIUC).

> carefully evaluated what to do with C1 (U+009B) instead of the C0 ESC[
> opening for an escape sequence ­ here terminal emulators vary. These
> just make everything even more cumbersome.
> 
> ECMA-48 8.3.117 specifies ESC[1m as "bold or increased intensity".

I think one should interpret these in a "modern" way, not looking
too much at what old terminals were limited to. (Colour ("increased
intensity") should be handled completely separately from bold.)

> Should this scheme be extended for colors, too? What to do with the
> legacy 8/16 as well as the 256-color extensions wrt. the color
> palette? Should Unicode go into the business of defining a fixed set
> of colors, or allow to alter the palette colors using the OSC 4 and
> friends escape sequences which supported by about half of the terminal
> emulators out there?

IF extending to colour, only refer to "true colour" (RGB) command-sequence.
The colour palette versions are for the limitations of (semi-)old terminals.

> For 256-colors and truecolors, there are two or three syntaxes out
> there regarding whether the separator is a colon or a semicolon.

It can only be colon. Using semicolon would interfere with the syntax
for multiple style specifications in one command sequence. (I by mistake
wrote a semicolon there in an earlier post; sorry.)

> Some terminal emulators have made up some new SGR modes, e.g. ESC[4:3m
> for curly underline. What to do with them? Where to draw the line what

(Note colon, not semicolon, as separator.) Possible, partially matching
the capabilities for underlining via CSS (solid, dotted, dashed, wavy,
double). Depends on how much styling options one wants to pick up.

> to add to Unicode and what not to? Will Unicode possibly be a

I don't think anyone wants to make this part of the Unicode standard.
(A the most a Unicode technical note...; from Unicode's point of view.)

[...] 
> What to do with things that Unicode might also want to have, but
> doesn't exist in terminal emulators due to their nature, such as
> switching to a different font size?

While ECMA-48 only has a palette (content defined by the implementation)
of ten fonts, xterm (!), IIUC, has 'OSC 50;<font name> BEL' (it should be an
ST not BEL, and it should be a DCS not an OSC...) for more general
font switching. Not part of Doug's proposal summary of "good to implement
command sequences". And it has a string parameter, so it cannot formally
be a command-sequence (which can only have digits and some punctuation in
them).

But a much more limited 'ESC [50m' (not variable spacing, i.e. "monospace"
font) and 'ESC [26m' (variable spacing, i.e. "proportional" font) (exactly
which fonts are implementation defined), would be reasonable. Switch to
monospace for code snippets, for quoting text from a terminal emulator, or
for "ASCII/Unicode art" (which is still quite common).

For font SIZE, ECMA-48 has: 'ESC [2 I' (select "computer decipoint", which
seems to be the "point size" unit used on computers (slightly different from
older point size units)), and then 'ESC [16 C' for 16 points. Not part of
Doug's proposal summary of "good to implement command sequences". (Note the
space before the terminating letter of the sequences!)

ECMA-48 even has a font stretch command: 'ESC [<p1>;<p2> B'. E.g. double
height would be 'ESC [200;100 B' (I don't think these accumulate, so it's
relative to the set font size). Condensed style (narrowing the characters
but keeping the height) would, e.g., be 'ESC [100;75 B' (compare the 'wdth'
design axis in OpenType). (So for the time ECMA-48 was made, it is quite
advanced on these points.) As you can see, these thing are aimed at
typography/"print", not terminal emulators... And not part of Doug's
proposal.

>> This mechanism [...] is already supported
>> as widely as any new Unicode-only convention will ever be.
> 
> I truly doubt this, these escape sequences are specific to terminal
> emulation, an extremely narrow subset of where Unicode is used and
> rich text might be desired.

This kind of command sequences are popular to implement in terminal
(emulators), but the text styling command sequences are not at all
(from a standards, and technical, point of view) limited to terminal
(emulators).

> I see it a much more viable approach if Unicode goes for something
> brand new, something clean, easily parseable, and it remains the job

If done right, I don't see that the command sequences are that hard
to PARSE. (Doing it wrong will of course get you into all sorts of
trouble.) Interpreting (a selection) of them is slightly harder, but
is stuff that very commonly implemented (bold, underline, ...) as
long as one does not get into the a bit more advanced stuff like
condensed/extended by percentage (which is not so commonly implemented,
and not part of Doug's proposal).

> of specific applications to serve as a bridge between the two worlds.
> Or, if it wants to adopt some already existing technology, I find
> HTML/CSS a much better starting point.

(X)HTML/CSS is fine. But it requires 1) a "second" level of parsing
(actually several different parsers), and 2) is a huge task to implement.
Command sequences (à la ECMA-48) are 1) possible to parse out at the
text level, and 2) interpretation can be limited to "simple" styling
(like in Doug's proposal, perhaps extended some (or a lot, depending)...),
and then is a much smaller implementation task than HTML/CSS.

/Kent K





More information about the Unicode mailing list