A last missing link for interoperable representation

David Starner via Unicode unicode at unicode.org
Wed Jan 9 03:25:54 CST 2019


On Tue, Jan 8, 2019 at 11:58 PM James Kass via Unicode <unicode at unicode.org>
wrote:

>
> David Starner wrote,
>
>  > Can some books be mostly handled with Unicode plain text
>  > and italics? Sure. HTML can handle them quite nicely. ...
>
> Yes, many books can be handled very well with HTML using simple
> mark-up.  If I were producing a computer file to reproduce an old
> fiction novel, that's how I'd do it.  Not because it's better or simpler
> than plain text, but because it can't really be done in plain text at
> this time.  But if a section of the text is copy/pasted from the screen
> into an editor, some of the original information may be lost.
>

Looking at the Encyclopedia Brown book at hand, you'd lose any marking that
"The Case of the Headless Ghost" is the chapter header. While the picture
of the treasure chest may be gratuitous, but "he hung his sign outside the
garage:" is followed by an image of said sign that says "BROWN DETECTIVE
AGENCY...". If you copy/paste that without carrying the original image
along, some of the original information will be lost.

In the Gmail editor, I see buttons to make the text bold, italic, or
underlined, and to change the color, text size and font. English users tend
to see italics as part and parcel of the text formatting. One can argue
that's part of history, that italics is somehow different from bold and
underline and font and text size changes, but when the standard perception
conveniently matches how Unicode encodes the script, there doesn't seem
much point in changing things, especially with terabytes of text that
encodes italics separately from the plain text matter.

Frequently, copy/pasting material does preserve non-plain text features; if
I paste a title from Wikipedia into here, it will show up much larger then
the rest of the text. It's a pain, because I want the underlying text, not
how it was displayed in the context.

Honestly, I could argue that case should not be encoded. It would simplify
so much processing of Latin script text, and most of the time
case-sensitive operations are just wrong. Case is clearly a headache that
has to be dealt with in plain text, but it certainly doesn't encourage me
to add another set of characters that are basically the same but not.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190109/2e7e1c04/attachment.html>


More information about the Unicode mailing list