Why Nothing Ever Goes Away (was: Re: Acquiring DIS 10646)

Mon Oct 5 14:32:45 CDT 2015

On 10/5/2015 8:24 AM, Doug Ewell wrote:
> I too am puzzled as to what DIS 10646 and C1 control pictures have to do
> with each other.
>

What an *excellent* cue to start a riff on arcane Unicode history!

First, let me explain what I think Sean Leonard's concern here is.

1. On 10/4/2015 5:30 AM, Sean wrote: "I proposed adding C1 Control 
Pictures to
Unicode. ... The requirement is that all glyphs for U+0000 - U+00FF be 
graphically distinct."

Ah, but Sean has noticed that of all the representative glyphs we have use
in the current code charts for C1 control codes, exactly *3* of them share
an odd glyph. U+0080, U+0081, and U+0099 use the same dotted box
with an "XXX" in it. That creates a conflict with the requirement that
Sean has stated for glyphs for *graphic symbols for* control codes,
presumably for addition the to 2400 Control Pictures block and some
extensions elsewhere, each with a visually distinct representation.

2. The Unicode code charts are (deliberately) vague about U+0080, U+0081,
and U+0099. All other C1 control codes have aliases to the ISO 6429
set of control functions, but in ISO 6429, those three control codes don't
have any assigned functions (or names). And because the C1 aliases in
the Unicode code charts are (deliberately) based on ISO 6429, U+0080,
U+0081, and U+0099 are only identified as "<control>", with no alias in
the charts, and with an arbitrary "XXX" box glyph.

3. Concerned about this gap, Sean did some due diligence research on the
web, and turned up documentation pages such as:

http://utopia.knoware.nl/users/eprebel/Communication/CharacterSets/Controls.html

Pertinent to this discussion is the section for C1 on that page which
(incorrectly) includes "DIS 10646" in the list of "Standards". More to the
point, the entries for the 3 C1 code points in question are documented as:

08/00 ... PAD  PADding character (only in DIS 10646)
08/01 ... HOP High Octet Preset (only in DIS 10646)
...
09/09 ... SGCI Single Graphic Character Introducer (only in DIS 10646)

Aha! Hence the need to track down a copy of DIS 10646 (meaning in
actuality, the appropriately numbered WG2 N666, "DIS 10646", dated
November 4, 1990). That was actually what became DIS 1, the DIS
that failed, the DIS that led to the *second* DIS 10646, which was the
basis of the Unicode/10646 merger. But I digress... ;-)

4. O.k., so with that connection out of the way, I can proceed to the
topic of this thread: Why Nothing Ever Goes Away.

PAD, HOP, and SGCI were arcane, proposed architectural additions to
the early drafts of 10646, from the days when 10646 was still slavishly 
following
the ISO 2022 framework, and was avoiding C0 and C1 byte values
in all representations, including single-, double-, triple-, and 
quadruple-byte
forms for characters.

HOP was one of those half-baked terminal protocol byte compression
concoctions. The idea was that since some commonly used blocks
of characters would require double-byte representation, but would
all have the same "high octet", you could send a HOP, and then a bunch
of low octets down the line. In effect, it was intended as a script 
switcher.

SGCI was complementary to that. It would let you introduce a sequence
of multiple octets for a single character, without having to switch out
of your high octet preset mode.

PAD I forget the exact details of. Something to do with padding out
character representations into fixed length.

All of these were firmly rejected in the merger discussions and the failed
DIS vote. Actually, they were down in the noise compared to major issues
like CJK plane swapping and such, but there clearly was no need for
10646 to invent new control functions like these, and the early drafts
of the Unicode Standard had nothing of the sort.

So these were gone in DIS 1.2 for 10646. They were *never* published
as part of ISO 10646-1:1993 (or any later edition). Nor were they
ever published in an ISO control function standard. Nor were they
ever published in the Unicode Standard, of course. They were never
standard *anything* -- just ill-advised concept functions that later got
dropped in the drafts.

But wait! If these disappeared from any standard draft way back in 1991(!),
why are we still talking about them? Why are they still documented on
web pages for C1 control characters in 2015, 24 years later? Funny you
should ask!

The problem is that they went viral. And that in an age before anybody
really knew what "going viral" even meant. ;-) The first problem is that
a bunch of mnemonics for characters were published in an RFC. And
those mnemonics included characters from early drafts of 10646.
The notorious document in question is RFC 1345:

Simonsen, K., "Character Mnemonics & Character Sets", June 1992.

Go ahead, it is still there:

https://tools.ietf.org/rfc/rfc1345.txt

And that has entries for the non-existent control codes, which by the
time RFC 1345 was published, had *already* been removed from the
10646 drafts. To wit:

PA     0080    PADDING CHARACTER (PAD)

HO     0081    HIGH OCTET PRESET (HOP)

GC     0099    SINGLE GRAPHIC CHARACTER INTRODUCER (SGCI)

RFC 1345 was, in turn, referenced by other important IETF documents,
including the important RFC 2070, "Internationalization of the
Hypertext Markup Language", which defines the syntax for
character entity names.

Entity names for PAD, HOP, and SGCI then found their way into Java
and other implementations. They ended up referenced in tables
supporting regular expressions. And so on. Somehow they had
become the walking dead control functions.

This came back around to the Unicode Standard about the time the
U+1F514 BELL and U+0007 <control> alias BELL name collision
issue hit the fan. The UTC response to this problem was to augment
the formal name aliases to include all widely used control
function names and abbreviations, so that testing for name
collisions in that name space would prevent any future BELL/BELL
issues. See, in particular, the related PRI on this topic for
Unicode 6.1.0:

http://www.unicode.org/review/pri202/

which explicitly mentions U+0080, U+0081, and U+0099 and their
aliases, because of a need for backward compatibility to then-existing
usage in Perl 5.

The outcome of that PRI was to add a bunch of formal name aliases,
*including* ones for PAD, HOP, and SGCI (or SGC). To wit, from 
NameAliases.txt:

=======================================================

# PADDING CHARACTER and HIGH OCTET PRESET represent
# architectural concepts initially proposed for early
# drafts of ISO/IEC 10646-1. They were never actually
# approved or standardized: hence their designation
# here as the "figment" type. Formal name aliases
# (and corresponding abbreviations) for these code
# points are included here because these names leaked
# out from the draft documents and were published in
# at least one RFC whose names for code points was
# implemented in Perl regex expressions.

0080;PADDING CHARACTER;figment
0080;PAD;abbreviation
0081;HIGH OCTET PRESET;figment
0081;HOP;abbreviation

# SINGLE GRAPHIC CHARACTER INTRODUCER is another
# architectural concept from early drafts of ISO/IEC 10646-1
# which was never approved and standardized.

0099;SINGLE GRAPHIC CHARACTER INTRODUCER;figment
0099;SGC;abbreviation

=============================================================

Because of stability guarantees, however, NameAliases.txt is a
write-once, read-only, unerasable file. For better or for worse, we
are now stuck forever with those name aliases for U+0080, U+0081,
and U+0099, *even though* the relevant control functions were
never, ever actually standardized or used anywhere.

Think of them as just part of the arcane mysteries now: odd
labels for the three code points, which (nearly) nobody understands.

Another of the many Unicode just so stories. :-)

--Ken