Re: “plain text styling”…

Kent Karlsson kent.b.karlsson at bahnhof.se
Mon Jan 9 08:46:29 CST 2023



> 9 jan. 2023 kl. 09:22 skrev Marius Spix <marius.spix at web.de>:
> 
> We should also be aware that plain text styling has many potential security risks. For example, if we had the characters <START_ITALICS> and <RESET_STYLE> someone could create two different strings which look exactly the same like "Hallo World" and "H<START_ITALICS><RESET_STYLE>llo World". This may allow identity spoofing, bypassing regex filters, phishing or even hash collision attacks.

((Not sure where the ”a” went…))

This is true for any kind of ”edit” (automatic, semi-automatic, manual) that may be done between such a security check and any kind of ”execution”; including but not limited to:
Unicode normalisation
Replace malformed UTF-8 or UTF-16 with some kind of replacement, like ?, SUB, REPLACEMENT CHARACTER, or deleting them
Case mapping
Encoding mapping (like mapping to ASCII), which may ”loose”/replace non-convertible characters
”Drop accents” mapping
Adding or removing HTML tags
Expanding (or inserting) any kind of character or string references (like \uNNNN, &xNNNN;, <, \xNN, …)
Replacing quote marks by other quote marks
Removing/replacing any ”undesirable” control character or default ignorable character
Spell corrections
Correcting syntax errors (if it is some kind of command or query)
Do bidi reordering to get a ”visual order” string; or use bidi controls to confuse the character order; or any kind of inverse bidi reordering
and lots more

So there is nothing new or unique with using ECMA-48 text styling in this regard.

> Styling is supposed to be done at application layer.

Yes… (Not sure what you are aiming at here.)

Kind regards
/Kent K
> Gesendet: Sonntag, 08. Januar 2023 um 23:08 Uhr
> Von: "Kent Karlsson via Unicode" <unicode at corp.unicode.org>
> An: "Sławomir Osipiuk" <sosipiuk at gmail.com>
> Cc: unicode at corp.unicode.org
> Betreff: Re: “plain text styling”…
>  
>  
> 8 jan. 2023 kl. 18:34 skrev Sławomir Osipiuk via Unicode <unicode at corp.unicode.org <mailto:unicode at corp.unicode.org>>:
>  
> On Sunday, 08 January 2023, 09:15:21 (-05:00), Kent Karlsson via Unicode wrote:
> 
> The point is that the ”protocol” is at plain text level. That is why ECMA-48 styling can work for applications like terminal emulators, where higher-level protocols, like HTML, are out of the question.
> 
> This does not make sense. Both are formats that need to be interpreted by the display software or they just look like junk within the visible text.
>  
> Yes…
>  
> HTML and ECMA-48 are no different in principle.
>  
> On this point they are wildly different. One is possible to use in contexts such as terminal emulators, indeed intended for such use. The other one cannot be used in such contexts.
> And the precise reason is that one is a plain text protocol, and the other a higher level protocol. One cannot make a HTML(like) based terminal emulator, since the controls in HTML are purely printable characters (which in turn requires that certain characters *must* be represented via character escapes, like <, otherwise risk being part of a control).
>  
> Now, using ECMA-48 styling controls for styling text (that may be stored in a file) is not vitally dependent on that. It is, for that use, just a question of reuse of an already existing mechanism for specifying styling. That mechanism need not be locked in to be used only for terminal emulators. (Though some of the proposed addition may be useful also for terminal emulators, and indeed some already are; I ”grabbed” some suggestions from already implemented (in some terminal emulators) additions, with the intent of not compromising those implementations.)
>  
> You can write a terminal emulator that respects basic HTML styling.
>  
> Nope. Violates the plain text principle of terminal emulators. (Besides, HTML has a nesting structure, but that is a different obstacle for your suggestion here.)
>  
> The only reason it hasn't been done is because there is no demand, and that is because of historical reasons (including that many terminal scripting languages have syntax that would conflict with HTML).
>  
> From a simple (basic) text editor perspective that knows nothing about styling, what is the difference between displaying these two examples related to same intended result ?
> <b>bold</b>
> versus
> \x1b[1mbold\x1b[2m
> 
> The first one is a higher level protocol (interpreting substrings consisting purely of ”printable characters” as controls; counting SP,HT and LF as ”printable"), the second is a text level protocol.
> 
> No. Whether "<" or \x1b is a special syntax introducer makes no real difference.
>  
> Except that it does. See above.
>  
> You need something to recognize it and interpret it.
>  
> Yes, but that is not ”it”.
>  
> Both standards are about interpreting substrings, with opening and closing characters and formatting information between them. There is nothing inherently special about having the characters be below \x20, certainly not any more than, for example, using the tag characters.
> There very much is a difference between control characters and printable characters (including SP,LF,CR,HT), in that the latter are ”normal text” to be printed, while control characters are, well control characters not to be printed. True, the distinction is somewhat ”muddled” by that SP/CR/LF/HT/VT/NEL aren’t all that ”pure control”, but characters like SHY actually are control characters but not formally counted as such. Plus the various control characters introduced by Unicode (like bidi controls; note that HTML has it’s separate way of doing bidi controls, using printable characters, not the Unicode bidi controls). So I agree that it is not straight-forward, but there really is a difference.
>  
> Kind regards
> /Kent K
>  
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20230109/a1a9a83a/attachment.htm>


More information about the Unicode mailing list