Encoding italic

Philippe Verdy via Unicode unicode at unicode.org
Sun Jan 27 17:44:18 CST 2019

You're not very explicit about the Tag encoding you use for these styles.

Of course it must not be a language tag so the introducer is not U+E0001,
or a cancel-all tag so it is not prefixed by U+E007F
It cannot also use letter-like, digit-like and hyphen-like tag characters
for its introduction.
So probably you use some prefix in U+E0002..U+E001F and some additional tag
(tag "I" for italic, tag "B" for bold, tag "U" for underline, tag "S" for
strikethough?) and the cancel tag to return to normal text (terminate the
tagged sequence).

Or may be you just use standard HTML encoding by adding U+E0000 to each
character of the HTML tag syntax (including attributes and close tags,
allowing embedding?) So you use the "<" and ">" tag characters (possibly
also the space tag U+E0020, or TAB tag U+E0009 for separating attributes
and the quotation tags for attribute values)?
Is your proposal also allowing the embedding of other HTML objects (such as

In that case what you do is only to remap the HTML syntax outside the
standard text. If an attribute values contains standard text (such as <span
title="Some text">...</span>) do you also remap the attribute value, i.e.
"Some text"? Do you remap the technical name of the HTML tag itself i.e.
"span" in the last example?

And what is then the interest compared to standard HTML (it is not more
compact, and just adds another layer on top of it), except allowing to
embed it in places where plain HTML would be restricted by form inputs or
would be reconverted using character entities hiding the effect of "<", ">"
and "&" in HTML so they are not reinterpreted as HTML but as plain-text

Now let's suppose that your convention starts being decoded and used in
some applications, this could be used to transport sensitive active scripts
(e.g. Javascript event handlers or plain <script> elements): this adds an
extra layer of security needed now in these applications, plus updated to
security tools/antivirus scanners.
I bet in fact that all tag characters are most often restricted in text
input forms, and will be silently discarded or the whole text will be

For me the tag characters is just a quirk for trying to embed in text, some
higher level protocol which is actually not part of the text but only
metadata, including for use with existing language tags (in HTML/SVG we can
already use the lang="..." or xml:lang="..." for that purpose, in MIME and
HTTP(S) we can already use the "Language:" and "Accept-Language:" headers).
We were told that these tag characters were deprecated, and in fact even
their use for language tags has not found any significant use except some
trials (but there are now better technologies available in lot of
softwares, APIs and services, and application design/development tools, or
document editing/publishing tools).

Le dim. 27 janv. 2019 à 21:10, James Kass via Unicode <unicode at unicode.org>
a écrit :

> A new beta of BabelPad has been released which enables input, storing,
> and display of italics, bold, strikethrough, and underline in plain-text
> using the tag characters method described earlier in this thread.  This
> enhancement is described in the release notes linked on this download page:
> http://www.babelstone.co.uk/Software/index.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190128/5894ef40/attachment.html>

More information about the Unicode mailing list