Concise term for non-ASCII Unicode characters
Tony at Jollans.com
Mon Sep 21 14:54:23 CDT 2015
Goodness, sorry, no, I didn't mean that at all!!!
What I meant was that a recognised encoding should be used consistently,
regardless of the number of bytes required, and all encodings of Unicode
code points are necessarily potentially multi-byte. Single-byte encodings
may save a little bit of space, and may be Windows-1252, or Windows-1253, or
one of many other encodings but not, in any sense, Unicode encodings.
Windows code pages and their ilk predate Unicode, and I would only ever
expect to see them used in environments where legacy support is needed, and
would not expect a significant amount of new documentation about them to be
written. When it is necessary to describe them, one should do so fully and
properly, which is whatever it is, but they really have no meaning in a
Unicode context. Nor, as far as I'm aware, do the 0x80 to 0xFF octets have
any special meaning in Unicode that would require there to be a recognisable
term to describe them.
Code that processes arbitrary *character* sequences (for legibility or any
other reason) should, surely, work with characters, which may be sequences
of code points, each of which may be a sequence of bytes. I can think of no
reason for chopping up byte sequences except where they are going to be
recombined later, by the reverse treatment, and code, if required, that does
so probably has no idea of, and need not have any idea of, meaning, and can
only, surely, work with bytes.
The actual octets are, of course, used in combinations, but not singly in
any way that requires them to be described in Unicode terms. Or am I missing
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard
Sent: 21 September 2015 19:18
To: unicode at unicode.org
Subject: Re: Concise term for non-ASCII Unicode characters
On Mon, 21 Sep 2015 12:46:48 +0100
"Tony Jollans" <Tony at jollans.com> wrote:
> These days, it is pretty sloppy coding that cares how many bytes an
> encoding of something requires, although there may be many
> circumstances where legacy support is required.
Wow! Are you saying that code chopping up arbitrary character sequences for
legibility (and editability!) and to avoid buffering issues should generally
assume it will be read as UTF-8, and avoid splitting well-formed UTF-8
characters? (If the text is actually Windows-1252, there may be a lot of
apparently ill-formed UTF-8 characters/gibberish.)
> You say that, in some
> contexts, one needs to be really clear that the octets 0x80 - 0xFF are
> Unicode. Either something "is" Unicode, or it isn't. Either something
> uses a recognised encoding, or it doesn't. Using these octets to
> represent Unicode code points is not ASCII, is not UTF-8, and is not
> UCS-2/UTF-16; it could, perhaps, be EBCDIC.
But most of these octets *are* used to represent non-ASCII scalar values.
It's just that they have to operate in combinations for UTF-8.
More information about the Unicode