Concise term for non-ASCII Unicode characters

Sean Leonard lists+unicode at seantek.com
Tue Sep 29 11:20:50 CDT 2015


On 9/21/2015 5:17 PM, Peter Constable wrote:
> If you think it's a serious problem that there isn't one conventional 
> term for "characters outside the ASCII repertoire" or "UTF-8 
> multi-code-unit encoded representations" (since different authors 
> could devise different terminology solutions), then I suggest you 
> submit a document to UTC explaining why it's a problem, documenting 
> inconsistent or unclear terminology that's been used in some standards 
> / public specifications, and requesting that Unicode formally define 
> terminology for these concepts. I can't guarantee that UTC will do it, 
> but I can predict with confidence that it _won't_ do anything of that 
> nature if nobody submits such a document. Peter 

I am of the mind to do just that, then. I have seen different documents, 
standards, and standards bodies that have invented terminology around 
this term, and they are not always the same. Since these standards 
depend on Unicode, it would make a lot of sense for Unicode formally to 
define terminology for these concepts. With the proliferation of UTF-8 
(among other things), the boundary between 0x7F - 0x80 is more 
significant than the boundary between 0xFFFF - 0x10000.

Since this will be my first submission I would appreciate a co-author on 
this topic. Is anyone willing to help? Thanks in advance. Also, it is 
not clear if such a document is destined to become a Unicode Technical 
Report (UTR / PDUTR etc.), or if it should just be an informal write-up. 
I am guessing this is supposed to be somewhat informal but at the same 
time it (or the results of it) ought to appear in the UTC Document Search.

The current terminology that I am considering pursuing is "beyond 
ASCII", in various permutations, such as "beyond the ASCII range", 
"characters beyond ASCII", "code points beyond ASCII", etc. The term 
"beyond" implies a certain directionality, and to that extent, implies 
the Unicode repertoire as well as a Unicode encoding. We have seen on 
this list the blackflips required to clarify "non-ASCII", since things 
that are not ASCII literally could be a wide range of things.

I think there is some confusion about whether the term "Basic Latin" 
excludes the C0 control character range. Formally the standard seems 
clear enough to me that it is co-terminus with ASCII, but there is still 
confusion if you don't pore through the Standard. My thought is that 
maybe the Blocks.txt data should be modified to say "ASCII (Basic 
Latin)" instead of just "Basic Latin". (If we "go there", I would 
appreciate the wisdom of an experienced Unicode co-author. I am not 
confident touching that just by myself.)

Sean


More information about the Unicode mailing list