Concise term for non-ASCII Unicode characters

Tue Sep 29 13:50:40 CDT 2015

On 9/29/2015 10:30 AM, Sean Leonard wrote:
> On 9/29/2015 9:40 AM, Daniel Bünzli wrote:
>> I would say there's already enough terminology in the Unicode world 
>> to add more to it. This thread already hinted at enough ways of 
>> expressing what you'd like, the simplest one being "scalar values 
>> greater than U+001F". This is the clearest you can come up with and 
>> anybody who has basic knowledge of the Unicode standard
> Uh...I think you mean U+007F? :)

I agree that "scalar values greater than U+007F" doesn't just trip off 
the tongue,
and while technically accurate, it is bad terminology -- precisely 
because it
begs the question "wtf are 'scalar values'?!" for the average engineer.

>
> Perhaps it's because I'm writing to the Unicode crowd, but honestly 
> there are a lot of very intelligent software engineers/standards folks 
> who do not have the "basic knowledge of the Unicode standard" that is 
> being presumed. They want to focus on other parts of their systems or 
> protocols, and when it comes to the "text part", they just hand-wave 
> and say "Unicode!" and call it a day. ...

Well, from this discussion, and from my experience as an engineer, I 
think this comes down
to people in other standards, practices, and protocols dealing with the 
ages old problem
of on beyond zebra for characters, where the comfortable assumptions 
that byte=character
break down and people have to special case their code and documentation. 
Where buffers
overrun, where black hat hackers rub their hands in glee, and where 
engineers exclaim, "Oh gawd! I
can't just cast this character, because it's actually an array!"

And nowadays, we are in the age of universal Unicode. All (well, much, 
anyway) would be cool
if everybody were using UTF-32, because then at least we'd be back to 
32-bit-word=character,
and the programming would be easier. But UTF-32 doesn't play well with 
existing protocols
and APIs and storage and... So instead, we are in the age of "universal 
Unicode and almost
always UTF-8."

So that leaves us with two types of characters:

1. "Good characters"

These are true ASCII. U+0000..U+007F. Good because they are all single 
bytes in UTF-8
and because then UTF-8 strings just work like the Computer Science God 
always intended,
and we don't have to do anything special.

2. "Bad characters"

Everything else: U+0080..U+10FFFF. Bad because they require multiple 
bytes to represent
in UTF-8 and so break all the simple assumptions about string and buffer 
length.
They make for bugs and more bugs and why oh why do I have to keep 
dealing with
edge cases where character boundaries don't line up with allocated 
buffer boundaries?!!

I think we can agree that there are two types of characters -- and that 
those code point
ranges correctly identify the sets in question.

The problem then just becomes a matter of terminology (in the standards 
sense of
"terminology") -- coming up with usable, clear terms for the two sets. 
To be good
terminology, the terms have to be identifiable and neither too generic 
("good characters"
and "bad characters") or too abstruse or wordy ("scalar values less than 
or equal to U+007F" and
"scalar values greater than U+007F").

They also need to not be confusing. For example, "single-byte UTF-8" and 
"multi-byte UTF-8"
might work for engineers, but is a confusing distinction, because UTF-8 
as an encoding
form is inherently multi-byte, and such terminology would undermine the 
meaning of UTF-8
itself.

Finally, to be good terminology, the terms needs to have some reasonable 
chance of
catching on and actually being used. It is fairly pointless to have a 
"standardized way"
of distinguishing the #1 and #2 types of characters if people either 
don't know about
that standardized way or find it misleading or not helpful, and instead 
continue groping
about with their existing ad hoc terms anyway.

>
> In the twenty minutes since my last post, I got two different 
> responses...and as you pointed out, there are a lot of ways to express 
> what one would like. I would prefer one, uniform way (hence, 
> "standardized way").

Mark's point was that it is hard to improve on what we already have:

1. ASCII Unicode [characters] (i.e. U+0000..U+007F)

2. Non-ASCII Unicode [characters] (i.e. U+0080..U+10FFFF)

If we just highlight that terminology more prominently, emphasize it in the
Unicode glossary, and promote it relentlessly, it might catch on more 
generally,
and solve the problem.

More irreverently, perhaps we could come up with complete neologisms that
might be catchy enough to go viral -- at least among the protocol 
writers and
engineers who matter for this. Riffing on the small/big distinction and 
connecting
it to "u-*nichar*" for the engineers, maybe something along the lines of:

1. skinnichar

2. baloonichar

Well, maybe not those! But you get the idea. I'm sure there is a budding 
terminologist
out there who could improve on that suggestion!

At any rate, any formal contribution that suggests coming up with 
terminology for
the #1 and #2 sets should take these considerations under advisement. 
And unless
it suggests something that would pretty easily gain consensus as 
demonstrably better than
the #1 and #2 terms suggested above by Mark, it might not result in any
change in actual usage.

--Ken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150929/bdfa60f8/attachment.html>