Concise term for non-ASCII Unicode characters
Martin J. Dürst
duerst at it.aoyama.ac.jp
Sun Sep 20 19:51:32 CDT 2015
On 2015/09/20 23:48, Sean Leonard wrote:
> What is the most concise term for characters or code points
So we already have two different things we might need a term for.
> outside of
> the US-ASCII range (U+0000 - U+007F)? Sometimes I have referred to these
> as "extended characters"
Most of the characters outside the US-ASCII range are perfectly simple
and basic characters. I don't think the term 'extended' fits well here.
It gives the impression that everything except US-ASCII is somewhat
extraordinary, which in this day and age shouldn't be the case anymore.
> or "non-ASCII Unicode" but I do not find those
> terms precise. We are talking about the code points U+0080 - U+10FFFF. I
> suppose that this also refers to code points/scalar values that are not
> formally Unicode characters, such as U+FFFF.
Again we may need different terms depending on whether these are
included or not.
> Basically, I am looking for
> a concise term for values that would require multiple UTF-8 octets if
> encoded in UTF-8 (without referring to UTF-8 encoding specifically).
> "Non-ASCII" is not precise enough since character sets like Shift-JIS
> are non-ASCII.
Well, the non-ASCII characters in Shift-JIS are also contained in
Unicode, so depending on exactly what you want to talk about, Non-ASCII
characters may be good enough.
> Also a citation to a relevant standard (whether Unicode or otherwise)
> would be helpful.
> The terms "supplementary character" and "supplementary code point" are
> defined in the Unicode standard, referring to characters or code points
> above U+FFFF. I am looking for something like those, but for characters
> or code points above U+007F.
And then in some cases, you may want to exclude the C0 area
(U+0000-001F), or part of it, or some syntactically significant
characters (e.g. punctuation) in the remaining part.
Anyway, what I wanted to show is that depending on what you need it for,
there are so many different variations that it doesn't pay off to create
specific short terms for all of them, and the term you use currently may
be short enough.
More information about the Unicode