Beyond ASCII

Sean Leonard lists+unicode at seantek.com
Wed Sep 30 01:12:25 CDT 2015


On 9/29/2015 11:50 AM, Ken Whistler wrote:
> At any rate, any formal contribution that suggests coming up with 
> terminology for
> the #1 and #2 sets should take these considerations under advisement.

The original premise of this thread was (and is) to find the *most 
concise* term for that range U+0080 - U+10FFFF, regardless of whether 
that range is for characters, code points, scalar values, or coffee cup 
icons ☕️. Preferably, such a concise term would have support in the 
Unicode Standard, or in some other standard. I was not looking for a 
totally new, invented term, but rather a term that has empirical, 
standards-based support.

A full survey of the Unicode Standard 8.0 finds that the term "beyond 
ASCII" has textual support:
p. 1 Introduction: While taking the ASCII character set as its starting 
point, the Unicode Standard goes far
beyond ASCII’s limited ability [...]

p. 37 ASCII Transparency: [UTF-8] maintains transparency for all of the
ASCII code points (0x00..0x7F). That means Unicode code points 
U+0000..U+007F are
[thus] indistinguishable from ASCII itself. [...] Beyond the ASCII
range of Unicode, many [...] scripts are represented by two bytes [in 
UTF-8...]

p. 200 Programming Languages: A limitation of the ISO/ANSI C model is 
its assumption that characters can always be processed in isolation. 
Implementations that choose to go beyond the ISO/ANSI C model may
find it useful to mix widths within their APIs.
{This formulation is not "beyond ASCII", but uses the preposition 
"beyond" in the exact same sense, since ASCII is fixed-width and forms 
an underlying assumption of the ISO/ANSI C model.}

p. 237 Case Mappings: A number of complications to case mappings occur 
once the repertoire of characters is
expanded beyond ASCII.

p. 677 Han / CJK Unified Ideographs Extension B: The ideographs in the 
CJK Unified Ideographs Extension B block represent an additional set of 
42,711 unified ideographs beyond the 27,496 included in The Unicode 
Standard, Version 3.0.
{This formulation uses the preposition "beyond" in the exact same sense, 
namely, a subsequent range that is beyond the original range.}
Ditto for Extension C, Extension D, Extension E

Finally, (case) "beyond ASCII" is in the Index at p. 237.


Perhaps this thread would have gone differently if the original subject 
was "Beyond ASCII" instead of...that other one. ��

Now, I am not saying that the term *must* be "beyond ASCII". However the 
term "non-ASCII" (with or without "Unicode") has no support in the 
Unicode Standard 8.0. The only occurrence is the reference to RFC 2047, 
and in that document, "non-ASCII" clearly means any and every character 
encoding ever invented, not specifically Unicode.


Another thing is the oxymoron "ASCII Unicode" (the opposite of 
"non-ASCII Unicode"). Actually ASCII is a formal subset of Unicode...at 
the beginning. ASCII itself (ANSI X3.4-1986) is a 7-bit character set; 
it does not limit itself to any particular word length so long as the 7 
bits are in those combinations. Therefore U+0000 - U+007F characters 
encoded in UTF-32 or UTF-16 are in ASCII codes; they are truly ASCII 
characters. When a bit combination '?' (0x3F) is loaded into a 64-bit 
register on a CPU, is it still an ASCII character? My view is yes.

They are not in ASCII *encoding*, as *encoding* is limited to a sequence 
of 7-bit or 8-bit combinations (X3.4-1986 Section 2.1.1(1)). My point 
here is that to be correct, one ought to use some sort of preposition, 
namely "ASCII in Unicode" or "ASCII [characters/code points/scalar 
values] in Unicode"--but if you slice off "in Unicode", you are left 
with "ASCII" and that is just fine. This is another basis for the 
proposition that "beyond ASCII" (e.g., "characters beyond ASCII [in 
Unicode]", "beyond the ASCII range [of Unicode]") makes sense.

Regards,

Sean


More information about the Unicode mailing list