Italics get used to express important semantic meaning, so unicode should support them

Kent Karlsson kent.b.karlsson at bahnhof.se
Tue Dec 15 17:07:05 CST 2020


(Below)

> 14 dec. 2020 kl. 18:02 skrev Sławomir Osipiuk via Unicode <unicode at unicode.org>:


> If you or someone else chooses to make a proposal, my own recommendation would be this:
> 
> - Assign a new character U+E0002 FORMAT TAG
> - The syntax follows the specification for tagging (chapter 23.9)
> - U+E0002 can be followed by any combination of U+E0062 (bold) U+E0065 (emphatic) U+E0069 (italic) and U+E0079 (underlined) to indicate a span of text with that formatting.
> - U+E0002 U+E007F CANCEL TAG to cancel all formatting
> - Any use of U+E0002 overrides previous formatting (i.e. a "bold" tag alone cancels a previous "italic" tag), so format nesting must be done by combining all desired formats into a single tag.
> - This method should only be used in cases where formatting is required without a higher-level protocol
> - This method should not be used in instances where loss of formatting would greatly alter the meaning of the text or render it incomprehensible.
> - Strikethrough and super/subscript are deliberately omitted for the above reason.

Now, where did I see something very much like this??? 

…

…

Oh yes, ECMA-48. Not exactly the same, but quite close. Indeed very close (especially the ”invisible by default” (”default ignorable”) IF parsed correctly). And… ECMA-48 is already a standard. And… ECMA-48 is already successful, and still used every day by very many people. Though it is primarily used in terminal emulators. (Nit: ECMA-48 does have strikethrough… And more. As does HTML/CSS, and when doing ”copy as plain text”, also that formatting disappear.)

Your U+E0002 FORMAT TAG: ECMA-48  CSI … m
Your U+E0062 (bold): ECMA-48  CSI 1m
Your U+E0065 (emphatic): don’t know what you mean by that
Your U+E0069 (italic): ECMA-48  CSI 3m
Your U+E0079 (underlined): ECMA-48 CSI 4m
Your U+E007F CANCEL TAG: ECMA-48  CSI 0m

It is not entirely inconceivable to map all the (otherwise) printable characters used by such control sequences to TAG characters, thus making the ”default ignorable” part of this a bit easier.

Extra nit: Some markdowns (however did that name stick?) allow for strikethrough as well, as -stricken-. Though a bit intuitive, it way too often has an unexpected effect where no strikethrough was intended (try doing ’ls -l’ in your Linux terminal, and paste the result into some place that have that kind of markdown).

”Math Italic” is a hack for MathML. If done right, MathML would not have needed them either. ”Math Italic” for emphasis in running text (not MathML) only ”works” (sort of, and partially) for English, nearly no other language. Please don’t use the ”Math italic/bold/etc” outside of MathML.

/Kent Karlsson

PS
First edition of ECMA-48 came in 1976. About 44 years ago.


> Advantages:
> - Only a single new character needs definition.
> - Uses an existing framework (tags)
> - Formatting is ignorable, implementation is optional
> - A viable method to preserve 95%+ of typical semantic formatting in plain-text
> - IMO a stronger case to have this than either language tags or annotations (argument is to accurately preserve the lot of existing documents that include rudimentary formatting, rather than just invent new features).
> 
> Disadvantages:
> https://xkcd.com/927/
> 
> Sławomir Osipiuk
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201216/c4913093/attachment.htm>


More information about the Unicode mailing list