Concise term for non-ASCII Unicode characters
lists+unicode at seantek.com
Wed Sep 30 00:07:35 CDT 2015
On 9/29/2015 11:50 AM, Ken Whistler wrote:
> On 9/29/2015 10:30 AM, Sean Leonard wrote:
>> On 9/29/2015 9:40 AM, Daniel Bünzli wrote:
>>> I would say there's already enough terminology in the Unicode world
>>> to add more to it. This thread already hinted at enough ways of
>>> expressing what you'd like, the simplest one being "scalar values
>>> greater than U+001F". This is the clearest you can come up with and
>>> anybody who has basic knowledge of the Unicode standard
>> Uh...I think you mean U+007F? :)
> I agree that "scalar values greater than U+007F" doesn't just trip off
> the tongue,
> and while technically accurate, it is bad terminology -- precisely
> because it
> begs the question "wtf are 'scalar values'?!" for the average engineer.
>> Perhaps it's because I'm writing to the Unicode crowd, but honestly
>> there are a lot of very intelligent software engineers/standards
>> folks who do not have the "basic knowledge of the Unicode standard"
>> that is being presumed. They want to focus on other parts of their
>> systems or protocols, and when it comes to the "text part", they just
>> hand-wave and say "Unicode!" and call it a day. ...
> Well, from this discussion, and from my experience as an engineer, I
> think this comes down
> to people in other standards, practices, and protocols dealing with
> the ages old problem
> of on beyond zebra for characters, where the comfortable assumptions
> that byte=character
> break down and people have to special case their code and
> documentation. Where buffers
> overrun, where black hat hackers rub their hands in glee, and where
> engineers exclaim, "Oh gawd! I
> can't just cast this character, because it's actually an array!"
> And nowadays, we are in the age of universal Unicode. All (well, much,
> anyway) would be cool
> if everybody were using UTF-32, because then at least we'd be back to
> and the programming would be easier. But UTF-32 doesn't play well with
> existing protocols
> and APIs and storage and... So instead, we are in the age of
> "universal Unicode and almost
> always UTF-8."
> So that leaves us with two types of characters:
> 1. "Good characters"
> These are true ASCII. U+0000..U+007F. Good because they are all single
> bytes in UTF-8
> and because then UTF-8 strings just work like the Computer Science God
> always intended,
> and we don't have to do anything special.
> 2. "Bad characters"
> Everything else: U+0080..U+10FFFF. Bad because they require multiple
> bytes to represent
> in UTF-8 and so break all the simple assumptions about string and
> buffer length.
> They make for bugs and more bugs and why oh why do I have to keep
> dealing with
> edge cases where character boundaries don't line up with allocated
> buffer boundaries?!!
> I think we can agree that there are two types of characters -- and
> that those code point
> ranges correctly identify the sets in question.
> The problem then just becomes a matter of terminology (in the
> standards sense of
> "terminology") -- coming up with usable, clear terms for the two sets.
> To be good
> terminology, the terms have to be identifiable and neither too generic
> ("good characters"
> and "bad characters") or too abstruse or wordy ("scalar values less
> than or equal to U+007F" and
> "scalar values greater than U+007F").
> They also need to not be confusing. For example, "single-byte UTF-8"
> and "multi-byte UTF-8"
> might work for engineers, but is a confusing distinction, because
> UTF-8 as an encoding
> form is inherently multi-byte, and such terminology would undermine
> the meaning of UTF-8
> Finally, to be good terminology, the terms needs to have some
> reasonable chance of
> catching on and actually being used. It is fairly pointless to have a
> "standardized way"
> of distinguishing the #1 and #2 types of characters if people either
> don't know about
> that standardized way or find it misleading or not helpful, and
> instead continue groping
> about with their existing ad hoc terms anyway.
>> In the twenty minutes since my last post, I got two different
>> responses...and as you pointed out, there are a lot of ways to
>> express what one would like. I would prefer one, uniform way (hence,
>> "standardized way").
> Mark's point was that it is hard to improve on what we already have:
> 1. ASCII Unicode [characters] (i.e. U+0000..U+007F)
> 2. Non-ASCII Unicode [characters] (i.e. U+0080..U+10FFFF)
> If we just highlight that terminology more prominently, emphasize it
> in the
> Unicode glossary, and promote it relentlessly, it might catch on more
> and solve the problem.
> More irreverently, perhaps we could come up with complete neologisms that
> might be catchy enough to go viral -- at least among the protocol
> writers and
> engineers who matter for this. Riffing on the small/big distinction
> and connecting
> it to "u-*nichar*" for the engineers, maybe something along the lines of:
> 1. skinnichar
> 2. baloonichar
> Well, maybe not those! But you get the idea. I'm sure there is a
> budding terminologist
> out there who could improve on that suggestion!
> At any rate, any formal contribution that suggests coming up with
> terminology for
> the #1 and #2 sets should take these considerations under advisement.
> And unless
> it suggests something that would pretty easily gain consensus as
> demonstrably better than
> the #1 and #2 terms suggested above by Mark, it might not result in any
> change in actual usage.
Thank you for this post. Slightly tongue-in-cheek but I think that it
captures the issues at play.
More information about the Unicode