Concise term for non-ASCII Unicode characters

Sean Leonard lists+unicode at seantek.com
Wed Sep 30 00:07:35 CDT 2015


On 9/29/2015 11:50 AM, Ken Whistler wrote:
>
>
> On 9/29/2015 10:30 AM, Sean Leonard wrote:
>> On 9/29/2015 9:40 AM, Daniel Bünzli wrote:
>>> I would say there's already enough terminology in the Unicode world 
>>> to add more to it. This thread already hinted at enough ways of 
>>> expressing what you'd like, the simplest one being "scalar values 
>>> greater than U+001F". This is the clearest you can come up with and 
>>> anybody who has basic knowledge of the Unicode standard
>> Uh...I think you mean U+007F? :)
>
> I agree that "scalar values greater than U+007F" doesn't just trip off 
> the tongue,
> and while technically accurate, it is bad terminology -- precisely 
> because it
> begs the question "wtf are 'scalar values'?!" for the average engineer.
>
>>
>> Perhaps it's because I'm writing to the Unicode crowd, but honestly 
>> there are a lot of very intelligent software engineers/standards 
>> folks who do not have the "basic knowledge of the Unicode standard" 
>> that is being presumed. They want to focus on other parts of their 
>> systems or protocols, and when it comes to the "text part", they just 
>> hand-wave and say "Unicode!" and call it a day. ...
>
> Well, from this discussion, and from my experience as an engineer, I 
> think this comes down
> to people in other standards, practices, and protocols dealing with 
> the ages old problem
> of on beyond zebra for characters, where the comfortable assumptions 
> that byte=character
> break down and people have to special case their code and 
> documentation. Where buffers
> overrun, where black hat hackers rub their hands in glee, and where 
> engineers exclaim, "Oh gawd! I
> can't just cast this character, because it's actually an array!"
>
> And nowadays, we are in the age of universal Unicode. All (well, much, 
> anyway) would be cool
> if everybody were using UTF-32, because then at least we'd be back to 
> 32-bit-word=character,
> and the programming would be easier. But UTF-32 doesn't play well with 
> existing protocols
> and APIs and storage and... So instead, we are in the age of 
> "universal Unicode and almost
> always UTF-8."
>
> So that leaves us with two types of characters:
>
> 1. "Good characters"
>
> These are true ASCII. U+0000..U+007F. Good because they are all single 
> bytes in UTF-8
> and because then UTF-8 strings just work like the Computer Science God 
> always intended,
> and we don't have to do anything special.
>
> 2. "Bad characters"
>
> Everything else: U+0080..U+10FFFF. Bad because they require multiple 
> bytes to represent
> in UTF-8 and so break all the simple assumptions about string and 
> buffer length.
> They make for bugs and more bugs and why oh why do I have to keep 
> dealing with
> edge cases where character boundaries don't line up with allocated 
> buffer boundaries?!!
>
> I think we can agree that there are two types of characters -- and 
> that those code point
> ranges correctly identify the sets in question.
>
> The problem then just becomes a matter of terminology (in the 
> standards sense of
> "terminology") -- coming up with usable, clear terms for the two sets. 
> To be good
> terminology, the terms have to be identifiable and neither too generic 
> ("good characters"
> and "bad characters") or too abstruse or wordy ("scalar values less 
> than or equal to U+007F" and
> "scalar values greater than U+007F").
>
> They also need to not be confusing. For example, "single-byte UTF-8" 
> and "multi-byte UTF-8"
> might work for engineers, but is a confusing distinction, because 
> UTF-8 as an encoding
> form is inherently multi-byte, and such terminology would undermine 
> the meaning of UTF-8
> itself.
>
> Finally, to be good terminology, the terms needs to have some 
> reasonable chance of
> catching on and actually being used. It is fairly pointless to have a 
> "standardized way"
> of distinguishing the #1 and #2 types of characters if people either 
> don't know about
> that standardized way or find it misleading or not helpful, and 
> instead continue groping
> about with their existing ad hoc terms anyway.
>
>>
>> In the twenty minutes since my last post, I got two different 
>> responses...and as you pointed out, there are a lot of ways to 
>> express what one would like. I would prefer one, uniform way (hence, 
>> "standardized way").
>
> Mark's point was that it is hard to improve on what we already have:
>
> 1. ASCII Unicode [characters] (i.e. U+0000..U+007F)
>
> 2. Non-ASCII Unicode [characters] (i.e. U+0080..U+10FFFF)
>
> If we just highlight that terminology more prominently, emphasize it 
> in the
> Unicode glossary, and promote it relentlessly, it might catch on more 
> generally,
> and solve the problem.
>
> More irreverently, perhaps we could come up with complete neologisms that
> might be catchy enough to go viral -- at least among the protocol 
> writers and
> engineers who matter for this. Riffing on the small/big distinction 
> and connecting
> it to "u-*nichar*" for the engineers, maybe something along the lines of:
>
> 1. skinnichar
>
> 2. baloonichar
>
> Well, maybe not those! But you get the idea. I'm sure there is a 
> budding terminologist
> out there who could improve on that suggestion!
>
> At any rate, any formal contribution that suggests coming up with 
> terminology for
> the #1 and #2 sets should take these considerations under advisement. 
> And unless
> it suggests something that would pretty easily gain consensus as 
> demonstrably better than
> the #1 and #2 terms suggested above by Mark, it might not result in any
> change in actual usage.

Thank you for this post. Slightly tongue-in-cheek but I think that it 
captures the issues at play.

Sean


More information about the Unicode mailing list