Surrogates and noncharacters

Tue May 12 09:50:02 CDT 2015

2015-05-12 15:56 GMT+02:00 Hans Aberg <haberg-1 at telia.com>:

>
> Indeed, that is why UTF-8 was invented for use in Unix-like environments.
>

Not the main reason: communication protocols, and data storage is also
based on 8-bit code units (even if storage group them by much larger
blocks).

UTF-8 is the default choice for all Internet protocols because all these
protocols are based on these units

This last remark is true except at lower levels, on the link interfaces and
on physical links where the unit is the bit or sometimes even smaller units
with fractions of bits, grouped into frames that not only transport data
bits but also specific items needed by the physical constraints, such as
maintaining the mean polarity, restricting the frequency bandwidth,
reducing noise in lateral bands, synchronizing clocks for data sampling,
reducing the power usage, allowing adaptation of bandwidth by insertion of
new parallel streams in the same shared band, allowing changing the framing
format in the case where the signal-noise ratio is degraded by using some
additional signals normally not used by the normal data stream, or adapting
to the degradation of the transport medium, or to some emergency situations
(or sometimes to local legal requirements) that require reducing the usage
to leave space for priority traffic (e.g. air regulation or military use)...

Each time the transport medium has to be shared with third parties (this is
the case for infrastructure networks or for the radiofrequencies in the
public airspace which may also be shared internationally), or if the medium
is known to have a slowly degrading quality (e.g. SSD storage), the
transport and storage protocols never use the whole bandwidth available and
reserve some regulatory space for specific signalisation that may be needed
to allow the current usages to be autoadapted: the physical format of
datastreams can change at any time, and what was initially encoded one way
will then be encoded another way (such things also occur extremely locally,
for example on databuses within computers, for example between the various
electronic chips on the same motherboard, or that could be plugged to it as
optional extensions ! Electronic devices are full of bus adapters that have
to manage the priority between concurrent traffics that are unpredictable,
and with changing environment conditions such as the current state of power
sources).

Programmers however only see the result on the upper layer data frames
where they manage bits, then they can create streams of bytes, that are
usable for transport protocols and interchange over a larger network or
computing system.

But for the worldwide network (Internet), everything is based on 8-bit
bytes that are the minimal units of information (and also the maximal
units: over larger units are not portable, not interoperable over the
global network) in all related protocols (including for negociating options
in these protocols): UTF-8 is then THE universal encoding that will
interoperate everywhere on the Internet, even if locally (in connected
hosts), other encoding may be used (which ''may'' be more efficiently
processed) after a simple conversion (this does not necessarly requires
changing the size of code units used in local protocols and interfaces, for
example there could exist some reencoding, or data compression or
expansion).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150512/22522309/attachment.html>