Concise term for non-ASCII Unicode characters
kenwhistler at att.net
Tue Sep 29 13:50:40 CDT 2015
On 9/29/2015 10:30 AM, Sean Leonard wrote:
> On 9/29/2015 9:40 AM, Daniel Bünzli wrote:
>> I would say there's already enough terminology in the Unicode world
>> to add more to it. This thread already hinted at enough ways of
>> expressing what you'd like, the simplest one being "scalar values
>> greater than U+001F". This is the clearest you can come up with and
>> anybody who has basic knowledge of the Unicode standard
> Uh...I think you mean U+007F? :)
I agree that "scalar values greater than U+007F" doesn't just trip off
and while technically accurate, it is bad terminology -- precisely
begs the question "wtf are 'scalar values'?!" for the average engineer.
> Perhaps it's because I'm writing to the Unicode crowd, but honestly
> there are a lot of very intelligent software engineers/standards folks
> who do not have the "basic knowledge of the Unicode standard" that is
> being presumed. They want to focus on other parts of their systems or
> protocols, and when it comes to the "text part", they just hand-wave
> and say "Unicode!" and call it a day. ...
Well, from this discussion, and from my experience as an engineer, I
think this comes down
to people in other standards, practices, and protocols dealing with the
ages old problem
of on beyond zebra for characters, where the comfortable assumptions
break down and people have to special case their code and documentation.
overrun, where black hat hackers rub their hands in glee, and where
engineers exclaim, "Oh gawd! I
can't just cast this character, because it's actually an array!"
And nowadays, we are in the age of universal Unicode. All (well, much,
anyway) would be cool
if everybody were using UTF-32, because then at least we'd be back to
and the programming would be easier. But UTF-32 doesn't play well with
and APIs and storage and... So instead, we are in the age of "universal
Unicode and almost
So that leaves us with two types of characters:
1. "Good characters"
These are true ASCII. U+0000..U+007F. Good because they are all single
bytes in UTF-8
and because then UTF-8 strings just work like the Computer Science God
and we don't have to do anything special.
2. "Bad characters"
Everything else: U+0080..U+10FFFF. Bad because they require multiple
bytes to represent
in UTF-8 and so break all the simple assumptions about string and buffer
They make for bugs and more bugs and why oh why do I have to keep
edge cases where character boundaries don't line up with allocated
I think we can agree that there are two types of characters -- and that
those code point
ranges correctly identify the sets in question.
The problem then just becomes a matter of terminology (in the standards
"terminology") -- coming up with usable, clear terms for the two sets.
To be good
terminology, the terms have to be identifiable and neither too generic
and "bad characters") or too abstruse or wordy ("scalar values less than
or equal to U+007F" and
"scalar values greater than U+007F").
They also need to not be confusing. For example, "single-byte UTF-8" and
might work for engineers, but is a confusing distinction, because UTF-8
as an encoding
form is inherently multi-byte, and such terminology would undermine the
meaning of UTF-8
Finally, to be good terminology, the terms needs to have some reasonable
catching on and actually being used. It is fairly pointless to have a
of distinguishing the #1 and #2 types of characters if people either
don't know about
that standardized way or find it misleading or not helpful, and instead
about with their existing ad hoc terms anyway.
> In the twenty minutes since my last post, I got two different
> responses...and as you pointed out, there are a lot of ways to express
> what one would like. I would prefer one, uniform way (hence,
> "standardized way").
Mark's point was that it is hard to improve on what we already have:
1. ASCII Unicode [characters] (i.e. U+0000..U+007F)
2. Non-ASCII Unicode [characters] (i.e. U+0080..U+10FFFF)
If we just highlight that terminology more prominently, emphasize it in the
Unicode glossary, and promote it relentlessly, it might catch on more
and solve the problem.
More irreverently, perhaps we could come up with complete neologisms that
might be catchy enough to go viral -- at least among the protocol
engineers who matter for this. Riffing on the small/big distinction and
it to "u-*nichar*" for the engineers, maybe something along the lines of:
Well, maybe not those! But you get the idea. I'm sure there is a budding
out there who could improve on that suggestion!
At any rate, any formal contribution that suggests coming up with
the #1 and #2 sets should take these considerations under advisement.
it suggests something that would pretty easily gain consensus as
demonstrably better than
the #1 and #2 terms suggested above by Mark, it might not result in any
change in actual usage.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode