Italics get used to express important semantic meaning, so unicode should support them

Sławomir Osipiuk sosipiuk at gmail.com
Mon Dec 14 11:02:26 CST 2020


On Sun, Dec 13, 2020 at 12:47 AM Asmus Freytag via Unicode <unicode at unicode.org> wrote:
>
> Write a killer social media app that uses these in an integral fashion and requires them for interoperability and then sit back and watch how long they stay deprecated ...

That, or perhaps something like Wikidata could use it. ;)

I slept on it, and I'm leaning to the other side now. I think of the paper books I've read, and italics often appear within the text. Are the books "plain text"? Do the italics really fall into the category of typesetting and style, like the choice of overall font? Or are they a meaningful part of the text itself? Should it be possible to fit the content of a whole novel into a .txt file without losing any semantic meaning? The "spirit of Unicode" whispers that it should. Of course some books contain charts and graphics, and Unicode can't do everything, but if a solution can cover 95% of cases, it at least deserves consideration.

On Fri, Dec 11, 2020 at 1:13 PM Christian Kleineidam via Unicode <unicode at unicode.org> wrote:
>
> Create a new unicode character for begin/end italic formatting and begin/end bold formatting that works like the unicode character for the Right-to-Left switch.

If you or someone else chooses to make a proposal, my own recommendation would be this:

- Assign a new character U+E0002 FORMAT TAG
- The syntax follows the specification for tagging (chapter 23.9)
- U+E0002 can be followed by any combination of U+E0062 (bold) U+E0065 (emphatic) U+E0069 (italic) and U+E0079 (underlined) to indicate a span of text with that formatting.
- U+E0002 U+E007F CANCEL TAG to cancel all formatting
- Any use of U+E0002 overrides previous formatting (i.e. a "bold" tag alone cancels a previous "italic" tag), so format nesting must be done by combining all desired formats into a single tag.
- This method should only be used in cases where formatting is required without a higher-level protocol
- This method should not be used in instances where loss of formatting would greatly alter the meaning of the text or render it incomprehensible.
- Strikethrough and super/subscript are deliberately omitted for the above reason.

Advantages:
- Only a single new character needs definition.
- Uses an existing framework (tags)
- Formatting is ignorable, implementation is optional
- A viable method to preserve 95%+ of typical semantic formatting in plain-text
- IMO a stronger case to have this than either language tags or annotations (argument is to accurately preserve the lot of existing documents that include rudimentary formatting, rather than just invent new features).

Disadvantages:
https://xkcd.com/927/

Sławomir Osipiuk




More information about the Unicode mailing list