Encoding italic (was: A last missing link)

Kent Karlsson via Unicode unicode at unicode.org
Thu Jan 24 17:46:35 CST 2019


Den 2019-01-24 03:21, skrev "Mark E. Shoulson via Unicode"
<unicode at unicode.org>:

> On 1/22/19 6:26 PM, Kent Karlsson via Unicode wrote:
>> Ok. One thing to note is that escape sequences (including control sequences,
>> for those who care to distinguish those) probably should be "default
>> ignorable" for display. Requiring, or even recommending, them to be default
>> ignorable for other processing (like sorting, searching, and other things)
>> may be a tall order. So, for display, (maximal) substrings that match:
>> 
>> \u001B[\u0020-\002F]*[\u0030-\007E]|
>> (\u001B'['|\009B)[\u0030-\003F]*[\u0020-\002F]*[\u0040-\007E]
>> 
>> should be default ignorable (i.e. invisible, but a "show invisibles" mode
>> would show them; not interpreted ones should be kept, even if interpreted
>> ones need not, just (re)generated on save). That is as far as Unicode
>> should go.
> 
> So it isn't just "these characters should be default ignorable", but
> "this regular expression is default ignorable."  This gets back to
> "things that span more than a character" again, only this time the
> "span" isn't the text being styled, it's the annotation to style it. 

True. That is how ECMA/ISO/ANSI escape/control-sequences are designed.
Had they not already been designed, and implemented, but we were to do
a design today, it would surely be done differently; e.g. having
"controls" that consisted only of (individually) "default-ignorable"
characters.

But, and this is the important thing here:

a) The current esc/control-sequences is an accepted standard,
since long.

b) This standard is still in very much active use, albeit mostly
by terminal emulators. But the styling stuff need not at all
be limited to terminal emulators.

Since it is an actively and widely used standard, I don't see the
point of trying to design another way of specifying "default
ignorable"-controls for text styling. (HTML, for instance, does not
have "default ignorable" controls, since ALL characters in the
"controls" are printable characters, so one needs a "second level"
for parsing the controls.) True, ignoring or interpreting an
esc/control-sequence requires some processing of substrings, since
some (all but the first) are printable characters. But not that hard.
It has been implemented over and over...

Had this standard been defunct, then there would be an opportunity
to design something different.

<OT>
> The "bash" shell has special escape-sequences (\[ and \]) to use in
> defining its prompt that tell the system that the text enclosed by them
> is not rendered and should not be counted when it comes to doing

Never heard of. Cannot find any reference mentioning them. Reference?
</OT>

> cursor-control and line-editing stuff (so you put them around, yep, the
> escape sequences for coloring or boldfacing or whatever that you want in
> your prompt). 

<OT>
Line editing stuff in bash is done on an internal buffer (there is a library
for doing this, and that library can be used by various other command line
programs; bash does not use the system input line editing). Then that
library tries to show what is in the buffer on the terminal. So, I'm
not sure what you are talking about; bash does NOT (somehow) scrape
the screen (terminal emulator window).
</OT>

Furthermore, colouring and bold/underline is quite common not only in
prompts, but also in output directed at a terminal from various programs.
(And it works just fine.) Unfortunately cut-and-paste tends to loose
much (or all) of that. (Would be nicer if it got converted to HTML,
RTF, .doc, or whatever is the target format; or just nicely kept if
"plain text" is the target.)

 
> That would seem to be at least simpler than a big ol'
> regexp, but really not that much of an improvement.  It also goes to
> show how things like this require all kinds of special handling,
> even/especially in a "simple" shell prompt (which could make a strong
> case for being "plain text", though, yes, terminal escape codes are a
> thing.)

They are NOT "terminal escape codes". It is just that, for now, it is
just about only terminal emulator that implement esc/control-sequences.
>From https://www.ecma-international.org/publications/standards/Ecma-048.htm:
"The control functions are intended to be used embedded in character-coded
data for interchange, in particular with character-imaging devices."
A (plain) text editor is an example of a 'character-imaging device'.
(Yes, the terminology is a bit dated.)

/Kent K

> 
> ~mark





More information about the Unicode mailing list