Concise term for non-ASCII Unicode characters

Philippe Verdy verdy_p at wanadoo.fr
Mon Sep 21 16:34:19 CDT 2015


2015-09-21 21:54 GMT+02:00 Tony Jollans <Tony at jollans.com>:

> The actual octets are, of course, used in combinations, but not singly in
> any way that requires them to be described in Unicode terms. Or am I
> missing
> something fundamental?
>

The term you are looking for are described in the standard describing the
standard Unicode encoding forms and schemes.

If you're speaking at the octet level, the proper term is "8-bit code unit"
and then look for the definition of "code units", not "code points" and not
"scalar values" or "characters" as well.

"Character" has another definition in programming languages, but Unicode is
not bound normatively to any programming language and their actual storage
size or transport size is not part of the standard, you'll need to look
into the technical documenttion of each programming language or transport
protocol or storage device: this is out of scope of the standard itself,
each environment describing their own API, library or adapter to interface
or convert data correctly with Unicode elements and texts, sometimes with
several competing interfaces or converters: on this list we are only
focused on standard interchange formats, but the problem is solved since
long, notably with Internet standards and RFCs such as MIME which has also
its own definition of "characters", because these standards are not
exclusively bound to Unicode but also support other legacy standards.

But even in this case these definitions are only at an upper layer only and
the lower layer may use other conversions, including data compression
technics, escaping modes, or could even workl with units smaller than
octets or even smaller than binary bits, or could multiplex some bits with
some complex state representation for example in modems working with bits
spread over a matrix of non-binary states with redundancy and
autocorrection. Even the order of bits is not defined in the Unicode
standard or in the internal lower layers of an interface (these are not the
layers concerned for interchange in a large network, they are specific to
each physical or virtual link between specific pairs of
hosts, buses/cables, hubs, switches, or routers and at this level they do
not even have to know if the data is actually containing text or which
upper layer encoding forms are used or implied.

So let's get back to your focus: you're wondering if there's a term for
octets with the high bit set, in the context of texts processed with some
standard Unicode algorithms.
- We have a term for 16-bit code units used in combinations to encode a
single code point : these are "surrogates".
- For 8-bit code units, there are at least 3 encodings described : UTF-8,
CESU-8 and SCSU. Each one has its own subranges of octets values processed
differently. The best way to name these ranges is to look into the standard
documentation of these encoding schemes. And these definitions are
independant of those used in other encoding schemes/forms (including those
defined by TUS), they do not operate at the same level and these
independant levels shuold (must?) be blackboxed (their scope is stronly
defined, and transparent in all other layers of processing, and all ayers
are replaceable by another competing encoding.

Note that initially, even TUS did not define any encoding scheme below the
level of code points and their scalar values. There was then no concept of
"code units", that were stadnardized only because a few encoding schemes
(UTFs) were integrated in a stadnard annexe, then directly in TUS itself as
they became ubiquitous for handling Unicode texts, and outweighted all
other (older) legacy standards (including Internet standards which still
survive with their mandarory or optional support of legacy standards: UTF-8
proved to be the easiest encoding working with a basic level of
compatibility with these older standards).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150921/6861129b/attachment.html>


More information about the Unicode mailing list