Italics get used to express important semantic meaning, so unicode should support them

Doug Ewell doug at ewellic.org
Wed Dec 23 16:40:43 CST 2020


Replying to a bunch of messages at once; the impending holidays and that
have limited my available time for extended posts. Some of these topics
may be “resolved” by now, so enjoy the nostalgia.
 
Sławomir Osipiuk wrote:
 
>> All TAG symbols placed between a U+E003D TAG LESS-THAN SIGN and a
>> U+E003E TAG GREATER-THAN SIGN, inclusive, are to be treated as if
>> they were they corresponding ASCII characters, and run that through
>> an HTML renderer. I guess if you wanted you could stipulate some
>> reduced or restricted subset of HTML
>
> I've been informed off-list that BabelPad uses this as a formatting
> option. So, it's been done.
 
I do use this feature in BabelPad at times -- in fact, just today while
copying the Unicode subsection on “plain text” from Section 2.2 of
the PDF, and not feeling inclined at the moment to open Word. But it’s
a bit like using PUA characters, or even SCSU: I know this usage is not
part of the standard and unlikely to be supported by anything else, so
absent an explicit agreement, I’d better keep it to myself.
 
> My guiding example is, "record fully the story text of a paperback
> novel".
 
So here is the salient part I gathered from the TUS definition, with
BabelPad formatting (hee hee) removed. Apologies if this passage is too
lengthy to qualify as fair use:
 
<quote>
Plain text represents character content only, not its appearance. It can
be displayed in a variety of ways and requires a rendering process to
make it visible with a particular appearance. If the same plain text
sequence is given to disparate rendering processes, there is no
expectation that rendered text in each instance should have the same
appearance. Instead, the disparate rendering processes are simply
required to make the text legible according to the intended reading.
This legibility criterion constrains the range of possible appearances.
The relationship between appearance and content of plain text may be
summarized as follows:
 
Plain text must contain enough information to permit the text to be
rendered legibly, and nothing more.
</quote>
 
The emphasis on “legibility” seems important here. Despite the focus
on “semantic meaning” in this thread, neither of those words appear
anywhere in the TUS definition of plain text.
 
Kent Karlsson wrote:
 
> Now, where did I see something very much like [Sławomir’s original
> suggestion with U+E0002 FORMAT TAG]??? 
>
> Oh yes, ECMA-48. Not exactly the same, but quite close. Indeed very
> close (especially the ”invisible by default” (”default ignorable”) IF
> parsed correctly). And… ECMA-48 is already a standard.
 
Perhaps surprisingly, or perhaps not, ECMA-48 is actually my favorite
mechanism for low-level styling of plain text, mostly for the reasons
Kent cites here and elsewhere: it’s lightweight, it’s been a
standard for a long time, and it’s already in extensive use by at
least one sector of text processing.
 
Kent might be one of the surprised ones, because I haven’t been a fan
of some of the “updates” to ECMA-48 that he has recommended, in
particular those that I feel extend, restrict, or invent too much. But I
like the standard in general, and some modest amount of updating is
probably inevitable to keep it current.
 
Sławomir:
 
> I didn't mention it because it's a bit outdated
 
“Outdated” is just generally a big red flag for me. If a standard
doesn’t meet modern needs, and can’t reasonably be made to do so,
that’s one thing, but the fact that it was developed some arbitrary
number of years ago is not something I care about. Unicode itself is
about 30 years old and I hope nobody sees that as evidence it needs
imminent replacing.
 
> But that "if parsed correctly" is quite the nit, isn't it?
 
This is true for any such mechanism. I remember early HTML authors being
upset when browsers stopped accepting <b>text <i>like</b> this</i>. Some
of the emoji mechanisms involving combinations of ZWSP, variation
selectors, Fitzpatrick swatches, and toupees might boggle some
implementers’ minds, but to play the game, you’ve got to learn the
rules.
 
David Starner wrote:
 
> ECMA-48 is not plain text.
 
Exactly so, but it’s a VERY thin layer above plain text, which is part
of what I like about it.
 
--
Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org
 



More information about the Unicode mailing list