Plain text (from Re: Avoidance variants)

Ken Whistler kenwhistler at
Fri Mar 27 13:32:30 CDT 2015

On 3/27/2015 8:15 AM, William_J_G Overington wrote:
>> Or you could just redefine "&" and "<" as
> ----
> That encapsulates what I do not like about using markup other than in very precise limited circumstances such as designing a web page.
> The characters have defined meanings in Unicode: HTML changes those meanings for the purpose of writing web page source code.

This represents a fundamental misunderstanding of what Unicode character 
is all about. I realize that William is unlikely to be deterred from his 
project to incorporate
various functions into what he conceives of as plain text, but in hopes 
of preventing
other folk from following him down the garden path, let's consider this 

The Unicode Standard specifies the character encoding for:


That specification clearly *identifies* the characters and their code 
points. The code
charts give the representative glyphs, to help in the identification. 
And the Unicode
Character Database provides precise specification of character 
properties for these
characters (as for all others), to assist in uniform and correct 

What the Unicode Standard does *not* do is define the *meanings* of 
these characters,
in the sense of their meaning in use. That is entirely up to the people 
who use them,
and more particularly, to people or agencies or committees or whoever 
decides to
apply such characters in particular orthographies, formal syntax 
definitions, conventions, or


1. if a < b and c > 0 then ac < bc

Here we have a simple algebraic expression, with ">" meaning 'is greater 
and "<" meaning 'is less than'. Talk to the mathematicians for exact 
meaning and usage.

2. <i>a</i>

Here we have the "<" and ">" being used as start and end markers of tags
in a markup scheme for text. Furthermore, the entire strings "<i>" and 
have further defined meaning as start and end of italic style runs. Talk to
W3C for exact meaning and usage.

3. ==> look here <==

Here we have a common ASCII plain text convention for use of "<" and ">"
are arrowheads for constructed arrows. Talk to... well, whoever, writes
plain text email these days for exact meaning and usage.

4. Following is some quoted plain text email:

 > -R
 >> -- Ken
 >> On Dec 7, 2011, at 6:41 AM, Richard COOK wrote:
 >>> On Dec 6, 2011, at 12:19 PM, Ken Lunde wrote:
 >>>> Richard,

Here we have another common ASCII plain text convention for use of ">" --
but this time it indicates both quotation and indentation. Repetition
of use of the ">" indicates repeated re-quotation and further indentation.
Talk to the implementers of plain text email clients for exact meaning 
and usage.

5. cout << "hello!" ;

Here we have an instance from C++ program text, where two "<" in
sequence represent a streaming operator. Talk to the documenters
of the C++ standard for exact meaning and usage.

6. template<class T>

Here we have a *different* instance from C++ program text, which
looks a little like HTML tags, but is not. In this case we are using
"<" and ">" again as paired delimiters ("angle brackets"), but the
syntax and interpretation is distinct. This is not a "tag". Talk to
the documenters of the C++ standard for exact meaning and usage.

7. <somebody at>

This is a convention used in email and other contexts, where 003C
and 003E used as paired delimiters (angle brackets) mark off an
email address or a URL. This might look like the HTML usage, but
it isn't. This isn't a tag. Talk to the implementers of email clients
and similar software for exact meaning and usage.

8. Jean a dit : << Je veux le faire. >>

Oops, here we have something different again. This is a *substitution*
use of 003C and 003E to emulate proper French guilllemet punctuation
marks. Poor guy doesn't have guillemets on his keyboard -- what is
he gonna do?!

I'm sure people could come up with many other examples in this vein.
The point of the long-winded exemplification is that characters
"mean" what people use them to "mean". As long as the *identity*
of the character is not in question and the code points are correctly
used and transmitted, then the plain text conformance requirements
of the Unicode Standard have been met.

And this is precisely as it should be. Just as it is not the business
of the Unicode Standard to dictate to anyone how they should spell
text, it is also not the business of the Unicode Standard to limit
or otherwise constrain what conventions of interpretation and/or
what additional layers of syntactic complexity (whether mathematics,
markup, or anything else) people build on top of text characters.

> That use should not act as an Aunt Sally argument for stopping the addition of additional Unicode characters into regular Unicode.
> Adding some additional characters for producing italics, bold and maybe colour as well into regular Unicode so that the facilities available for use in plain text format are extended. would, in my opinion, be a good thing.

It would be a bad thing. As Asmus has noted in this thread, proposals of
this ilk are dead on arrival at the UTC, because they do not understand
the appropriate layering of text processing. Just because a distinction
is *in* text, does not mean that it should be, ipso facto, defined in plain
text or encoded in characters.


More information about the Unicode mailing list