Concise term for non-ASCII Unicode characters

Peter Constable petercon at microsoft.com
Sun Sep 20 14:24:14 CDT 2015


Well, if the point is to refer to characters that would require two or more code units in UTF-8, then _accurate_ expressions would be, "Unicode characters beyond the Basic Latin block" or "Unicode characters above U+007F".


Peter 

-----Original Message-----
From: Steve Swales [mailto:steve at swales.us] 
Sent: Sunday, September 20, 2015 11:00 AM
To: Phillips, Addison <addison at lab126.com>
Cc: Peter Constable <petercon at microsoft.com>; Sean Leonard <lists+unicode at seantek.com>; unicode at unicode.org
Subject: Re: Concise term for non-ASCII Unicode characters

Exactly. I think the reason that non-ASCII feels non-concise is that there is widespread confusion between ASCII and Latin-1/ISO 8859-1 (which in turn is widely confused with Windows-1252).

-steve  




Sent from my iPhone


> On Sep 20, 2015, at 10:05 AM, Phillips, Addison <addison at lab126.com> wrote:
> 
> I agree, although I note that sometimes the additional (redundant) specificity of "non-7-bit-ASCII characters" is needed when talking to people unclear on what "ASCII" means.
> 
> Addison
> 
>> -----Original Message-----
>> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter 
>> Constable
>> Sent: Sunday, September 20, 2015 9:52 AM
>> To: Sean Leonard; unicode at unicode.org
>> Subject: RE: Concise term for non-ASCII Unicode characters
>> 
>> You already have been using "non-ASCII Unicode", which is about as 
>> concise and sufficiently accurate as you'll get. There's no term 
>> specifically defined in any standard or conventionally used for this.
>> 
>> 
>> Peter
>> 
>> -----Original Message-----
>> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Sean 
>> Leonard
>> Sent: Sunday, September 20, 2015 7:48 AM
>> To: unicode at unicode.org
>> Subject: Concise term for non-ASCII Unicode characters
>> 
>> What is the most concise term for characters or code points outside 
>> of the US-ASCII range (U+0000 - U+007F)? Sometimes I have referred to 
>> these as "extended characters" or "non-ASCII Unicode" but I do not 
>> find those terms precise. We are talking about the code points U+0080 
>> - U+10FFFF. I suppose that this also refers to code points/scalar 
>> values that are not formally Unicode characters, such as U+FFFF. 
>> Basically, I am looking for a concise term for values that would 
>> require multiple UTF-8 octets if encoded in UTF-8 (without referring to UTF-8 encoding specifically).
>> "Non-ASCII" is not precise enough since character sets like Shift-JIS 
>> are non- ASCII.
>> 
>> Also a citation to a relevant standard (whether Unicode or otherwise) 
>> would be helpful.
>> 
>> The terms "supplementary character" and "supplementary code point" 
>> are defined in the Unicode standard, referring to characters or code 
>> points above U+FFFF. I am looking for something like those, but for 
>> characters or code points above U+007F.
>> 
>> Thank you,
>> 
>> Sean
> 
> 



More information about the Unicode mailing list