Encoding of old compatibility characters

Mon Mar 27 12:18:03 CDT 2017

On 3/27/2017 7:44 AM, Charlotte Buff wrote:
> Now, one of Unicode’s declared goals is to enable round-trip 
> compatibility with legacy encodings. We’ve accumulated a lot of weird 
> stuff over the years in the pursuit of this goal. So it would be 
> natural to assume that the unencoded characters from the mentioned 
> sets [ATASCII, PETSCII, the ZX80 set, the Atari ST set, and the TI 
> calculator sets] would also be eligible for inclusion in the UCS.

Actually, it wouldn't be.

The original goal was to ensure round-trip compatibility with 
*important* legacy character encodings, *for which there was a need to 
convert legacy data, and/or an ongoing need to representation of text 
for interchange*.

 From Unicode 1.0: "The Unicode standard includes the character content 
of all major International Standards approved and published before 
December 31, 1990... [long list ensues] ... and from various industry 
standards in common use (such as code pages and character sets from 
Adobe, Apple, IBM, Lotus, Microsoft, WordPerfect, Xerox and others)."

Even as long ago as 1990, artifacts such as the Atari ST set were 
considered obsolete antiquities, and did not rise to the level of the 
kind of character listings that we considered when pulling together the 
original repertoire.

And there are several observations to be made about the "weird stuff" we 
have accumulated over the years in the pursuit of compatibility. A lot 
of stuff that was made up out of whole cloth, rather than being 
justified by existing, implemented character sets used in information 
interchange at the time, came from the 1991/1992 merger process between 
the Unicode Standard and the ISO/IEC 10646 drafts. That's how Unicode 
acquired blocks full of Arabic ligatures, for example.

Other, subsequent additions of small (or even largish) sets of oddball 
"characters" that don't fit the prototypical sets of characters for 
scripts and/or well-behaved punctuation and symbols, typically have come 
in with argued cases for the continued need in current text interchange, 
for complete coverage. For example, that is how we ended up filling out 
Zapf dingbats with some glyph pieces that had been omitted in the 
initial repertoire for that block. More recently, of course, the 
continued importance of Wingdings and Webdings font encodings on the 
Windows platform led the UTC to filling out the set of graphical 
dingbats to cover those sets. And of course, we first started down the 
emoji track because of the need to interchange text originating from 
widely deployed Japanese carrier sets implemented as extensions to 
Shift-JIS.

I don't think the early calculator character sets, or sets for the Atari 
ST and similar early consumer computer electronics fit the bill, 
precisely because there isn't a real text data interchange case to be 
made for character encoding. Many of the elements you have mentioned, 
for example, like the inverse/negative squared versions of letters and 
symbols, are simply idiosyncratic aspects of the UI for the devices, in 
an era when font generators were hard coded and very primitive indeed.

Documenting these early uses, and pointing out parts of the UI and 
character usage that aren't part of the character repertoire in the 
Unicode Standard seems an interesting pursuit to me. But absent a true 
textual data interchange issue for these long-gone, obsolete devices, I 
don't really see a case to be made for spending time in the UTC defining 
a bunch of compatibility characters to encode for them.

--Ken