Surrogates and noncharacters

Tue May 12 08:45:52 CDT 2015

2015-05-11 23:53 GMT+02:00 Hans Aberg <haberg-1 at telia.com>:

> It is perfectly fine considering the Unicode code points as abstract
> integers, with UTF-32 and UTF-8 encodings that translate them into byte
> sequences in a computer. The code points that conflict with UTF-16 might
> have been merely declared not in use until UTF-16 has been fallen out of
> use, replaced by UTF-8 and UTF-32.

The deprecation of UTF-16 and UTF-32 as encoding *schemes* ("charsets" in
MIME) is already very advanced. But they will certinaly not likely
disappear as encoding *forms* for internal use in binary APIs and in
several very popular programming languages: Java, Javascript, even C++ on
Windows platforms (where it is the 8-bit interface, based on legacy "code
pages" and with poor support of the UTF-8 encoding scheme as a Windows
"code page", is the one that is now being phased out), C#, J#...

UTF-8 will also remain for long as the prefered internal encoding for
Python, PHP (even if Python introduced also a 16-bit native datatype).

In all cases, programming languages are not based on any Unicode encoding
forms but on more or less opaque streams of code units using datatypes that
are not constrained by Unicode (because their "character" or "byte"
datatype is also used for binary I/O and for supporting also the conversion
of various binary structures, including executable code, and also because
even this datatype is not necessarily 8-bit but may be larger and not even
an even multiple of 8-bits)

One is going check that the code points are valid Unicode values somewhere,
> so it is hard to see to point of restricting UTF-8 to align it with UTF-16.
>

What I meant when starting discussing in this thread was just to obsolete
the unnecessary definitions of "x-bit strings" from TUS. The stadnard does
not need these definitions and if we want it to be really open to various
architectures, languages, protocols, all that is needed is only the
definition of "code units" specific to each standard UTF (encoding form or
encoding scheme when splitting code units to smaller code units and
ordering them, by only determining this order and the minimum set of
distinct values that these code units must support: we should not speak
about "bits", just about "sets" of distinct elements with a sufficient
cardinality).

So let's jsut speak about "UTF-8 code units", "UTF-16 code units", "UTF-32
code units" (not just "code units" and not even "Unicode code units", which
is also a non-sense given the existence of standardized compression schemes
defining also their own "XXX code units").

If the expressions "16-bit code units" has been used, it's purely for
internal use as a shortcut for the complete name, and these shortcuts are
not part of the external entities to standardize (they are not precise
enough and cannot be used safely out of their local context): consider
these definitions just as "private" ones (same meaning as in OOP) boxed as
internals to the TUS seen as a blackbox.

It's not the focus of TUS to discuss what are "strings": it's just the
mater of each integration platform that wants to use TUS.

In summary, the definitions in TUS should be split in two parts: those that
are "public" and needed by external references (in other standards), and
those that are private (many of them do not have even to be within the
generic section of the standard, they should be listed in the appropriate
sections needing them locally, and also clearly separating the "public" and
"private" interfaces.

In all cases, the public interfaces msut define precise and anambiguous
terms, bound to the standard or section of the standard defining them. Even
if later within that section a shortcut will be used as a convenience (to
make the text easier to read). We need "scopes" for these definitions (and
shorter aliases must be made private).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150512/7966809b/attachment.html>