Concise term for non-ASCII Unicode characters

Sean Leonard lists+unicode at seantek.com
Tue Sep 22 05:18:46 CDT 2015


On 9/22/2015 1:45 AM, Philippe Verdy wrote:
> I would not use the "clumsy 7-bit ASCII" due to the confusion created 
> since long when it could refer to any national version of ISO 646, 
> which reassign some code positions in the rande 0x00 to 0x07F to other 
> characters outside the range U+0000 to U+007F, while still remaining 
> 7-bit encodings.
> So insead of "7-bit ASCII" I highly prefer the term "US-ASCII" to make 
> sure it refers to the encoding of 7-bit code positions effectively to 
> U+0000..U+007F.
>
> So for code positions outside 0x00..0x7F, I would call them "not 
> US-ASCII" (none of them are bound to any Unicode "character" or "code 
> point" or "scalar value", they are just "code positions" or more 
> precisely "octet values with their most significant bit set to 1" 
> which is really long: "not US-ASCII" is fine as a shorter term).

Again having just read through ANSI X3.4-1986 (R1997), I would like to 
clarify some things.

The standard itself is titled:
American National Standard for Information Systems - Coded Character 
Sets - 7-Bit American National Standard Code for Information Interchange 
(7-Bit ASCII)

However, Clause 1.1 states:
This standard specifies a set of 128 characters (control characters and 
graphic characters, such as letters, digits, and symbols) with their 
coded representation. The American National Standard Code for 
Information Interchange may also be identified by the acronym ASCII 
(pronounced ask-ee). To explicitly designate a particular (perhaps 
prior) edition of this standard, the last two digits of the year of 
issue may be appended, as in "ASCII 68" or "ASCII 86".


According to the title, "7-Bit ASCII" is proper. However, according to 
the text, "ASCII" is sufficient. The "7-Bit" part really just emphasizes 
the fact that it is a 7-bit standard. The eighth bit is outside the 
scope of the standard (but see clause 2.1.1). (Incidentally, Clause 1.1 
is not Y2K compliant! Thus you should '86 that part of ASCII 86...hehe)

The term "US-ASCII" (see also RFC 2046 for a lot of discussion) is 
similarly redundant. After all, it is the *American* *National* Standard 
Code for Information Interchange. Even if you remove the term "National" 
(which does not appear in ASCII 68 or ASCII 63), it's still American. 
However, ASCII 68 (partially reprinted in RFC 20: 
<https://tools.ietf.org/html/rfc20>) actually permits "the notation 
ASCII (pronounced as'-key) or USASCII (pronounced you-sas'-key) [...] to 
mean the code prescribed by the latest issue of the standard". That is 
probably the genesis of US-ASCII. I wasn't alive at the time so I don't 
know. My suspicion is that "US-ASCII" was meant to disambiguate ASCII 86 
from ASCII 68 (which is referred to as "ASCII" in RFC 821) without 
referring to the year, and since 68 and 86 are transposed numerals, 
"US-ASCII" eliminates possible mix-ups.


My conclusion here is that "ASCII" is sufficient when talking about the 
range of (code or character) positions 0 - 127, regardless of how they 
are encoded, so long as they logically evaluate to the bit combinations 
of the 7-bit code described in ANSI X3.4-1986.

"Basic Latin" also works if you want to avoid the historic reference. 
But there are many systems in use that are ASCII-based (including the 
Internet, as RFC 20 is still in force), and the term "ASCII" is peppered 
throughout the Unicode Standard 8.0 with greater frequency than "Basic 
Latin" (which is acknowledged to be a synonym for "ASCII" in Sections 
5.7 and 6.2).

Sean





More information about the Unicode mailing list