Concise term for non-ASCII Unicode characters

Tue Sep 29 12:30:59 CDT 2015

On 9/29/2015 9:40 AM, Daniel Bünzli wrote:
> I would say there's already enough terminology in the Unicode world to add more to it. This thread already hinted at enough ways of expressing what you'd like, the simplest one being "scalar values greater than U+001F". This is the clearest you can come up with and anybody who has basic knowledge of the Unicode standard
Uh...I think you mean U+007F? :)

Perhaps it's because I'm writing to the Unicode crowd, but honestly 
there are a lot of very intelligent software engineers/standards folks 
who do not have the "basic knowledge of the Unicode standard" that is 
being presumed. They want to focus on other parts of their systems or 
protocols, and when it comes to the "text part", they just hand-wave and 
say "Unicode!" and call it a day. In particular there is a flow-down 
effect where terms from one standards body don't match with another 
standards body, perhaps because they got redefined over time for various 
reasons. The distinction between "characters", "abstract characters", 
"code points", and "scalar values" is not intuitively obvious to people 
without specialized knowledge of text processing issues. The fact that 
(modern implementations of) UTF-8 encoders and decoders are not supposed 
to process the surrogate code points (arbitrarily), for example, is a 
rather advanced topic that presumes knowledge of the interaction between 
UTF-16, UTF-8, what surrogate code points actually are, and the security 
implications of so-doing (UTR-36). Furthermore one has to parse the 
distinction between "well-formed" and "ill-formed".

In the twenty minutes since my last post, I got two different 
responses...and as you pointed out, there are a lot of ways to express 
what one would like. I would prefer one, uniform way (hence, 
"standardized way"). Just surveying the various standards that have 
tried to tackle this distinction with their own organic terminology will 
probably be revealing. Evidence-based should be the yardstick.

Best regards,

Sean