Why Nothing Ever Goes Away

Sean Leonard lists+unicode at seantek.com
Fri Oct 9 13:05:54 CDT 2015


Satisfactory answers, thank you very much.

Going back to doing more research. (Silence does not imply abandoning 
the C1 Control Pictures project; just a lot to synthesize.)

Regarding the three points U+0080, U+0081, and U+0099: the fact that 
Unicode defers mostly to ISO 6429 and other standards before its time 
(e.g., ANSI X3.32 / ISO 2047 / ECMA-17) means that it is not 
particularly urgent that those code points get Unicode names. I also do 
not find that their lack of definition precludes pictorial 
representations. In the current U+2400 block, the Standard says: "The 
diagonal lettering glyphs are only exemplary; alternate representations 
may be, and often are used in the visible display of control codes", 
and, Section 22.7.

I am now in possession of a copy of ANSI X3.32-1973 and ECMA-17:1968 
(the latter is available on ECMA's website). I find it worthwhile to 
point out that the Transmission Controls and Format Effectors were not 
standardized by the time of ECMA-17:1968, but the symbols are the same 
nonetheless. ANSI X3.32-1973 has the standardized control names for 
those characters.

Sean

On 10/6/2015 6:57 AM, Philippe Verdy wrote:
>
> 2015-10-06 14:24 GMT+02:00 Sean Leonard <lists+unicode at seantek.com 
> <mailto:lists+unicode at seantek.com>>:
>
>         2. The Unicode code charts are (deliberately) vague about
>         U+0080, U+0081,
>         and U+0099. All other C1 control codes have aliases to the ISO
>         6429
>         set of control functions, but in ISO 6429, those three control
>         codes don't
>         have any assigned functions (or names).
>
>
>     On 10/5/2015 3:57 PM, Philippe Verdy wrote:
>
>         Also the aliases for C1 controls were formally registered in
>         1983 only for the two ranges U+0084..U+0097 and U+009B..U+009F
>         for ISO 6429.
>
>
>     If I may, I would appreciate another history lesson:
>     In ISO 2022 / 6429 land, it is apparent that the C1 controls are
>     mainly aliases for ESC 4/0 - 5/15. ( @ through _ ) This might vary
>     depending on what is loaded into the C1 register, but overall, it
>     just seems like saving one byte.
>
>     Why was C1 invented in the first place?
>
>
> Look for the history of EBCDIC and its adaptation/conversion with 
> ASCII compatible encodings: round trip conversion wasneeded (using a 
> only a simple reordering of byte values, with no duplicates). EBCDIC 
> has used many controls that were not part of C0 and were kept in the 
> C1 set. Ignore the 7-bit compatiblity encoding using pairs, they were 
> only needed for ISO 2022, but ISO 6429 defines a profile where those 
> longer sequences are not needed and even forbidden in 8-bit contexts 
> or in contexts where aliases are undesirable and invalidated, such as 
> security environments.
>
> With your thoughts, I would conclude that assigning characters in the 
> G1 set was also a duplicate, because it is reachable with a C0 
> "shifting" control + a position of the G0 set. In that case ISO 8859-1 
> or Windows 1252 was also an unneeded duplication ! And we would live 
> today in a 7-bit only world.
>
> C1 controls have their own identity. The 7-bit encoding using ESC is 
> just a hack to make them fit in 7-bit and it only works where the ESC 
> control is assumed to play this function according to ISO 2022, ISO 
> 6429, or other similar old 7-bit protocols such as Videotext (which 
> was widely used in France with the free "Minitel" terminal, long 
> before the introduction of the Internet to the general public around 
> 1992-1995).
>
> Today Videotext is definitely dead (the old call numbers for this slow 
> service are now definitely defunct, the Minitels are recycled wastes, 
> they stopped being distributed and replaced by applications on PC 
> connected to the Internet, but now all the old services are directly 
> on the internet and none of them use 7-bit encodings for their HTML 
> pages, or their mobile applications). France has also definitely 
> abandoned its old French version of ISO 646, there are no longer any 
> printer supporting versions of ISO 646 other than ASCII, but they 
> still support various 8-bit encodings.
>
> 7-bit encodings are things of the past (they were only justified at 
> times where communication links were slow and generated lots of 
> transmission errors, and the only implemented mecanism to check them 
> was to use a single parity bit per character. Today we transmit long 
> datagrams and prefer using checks codes for the whole (such as CRC, or 
> autocorrecting codes). 8-bit encodings are much easier and faster to 
> process for transmitting not just text but also binary data.
>
> Let's forget the 7-bit world definitely. We have also abandonned the 
> old UTF-7 in Unicode ! I've not seen it used anywhere except in a few 
> old emails sent at end of the 90's, because many mail servers were 
> still not 8-bit clean and silently transformed non-ASCII bytes in 
> unpredictable ways or using unspecified encodings, or just siltently 
> dropped the high bit, assuming it was just a parity bit : at that 
> time, emails were not sent with SMTP, but with the old UUCP protocol 
> and could take weeks to be delivered to the final recipient, as there 
> was still no global routing infrastructure and many hops were 
> necessary via non-permanent modem links. My opinion of UTF-7 is that 
> it was just a temporary and experimental solution to help system 
> admins and developers adopt the new UCS, including for their old 7-bit 
> environments.


On 10/6/2015 8:33 AM, Asmus Freytag (t) wrote:
> On 10/6/2015 5:24 AM, Sean Leonard wrote:
>> And, why did Unicode deem it necessary to replicate the C1 block at 
>> 0x80-0x9F, when all of the control characters (codes) were equally 
>> reachable via ESC 4/0 - 5/15? I understand why it is desirable to 
>> align U+0000 - U+007F with ASCII, and maybe even U+0000 - U+00FF with 
>> Latin-1 (ISO-8859-1). But maybe Windows-1252, MacRoman, and all the 
>> other non-ISO-standardized 8-bit encodings got this much right: 
>> duplicating control codes is basically a waste of very precious 
>> character code real estate
>
> Because Unicode aligns with ISO 8859-1, so that transcoding from that 
> was a simple zero-fill to 16 bits.
>
> 8859-1 was the most widely used single byte (full 8-bit) ISO standard 
> at the time, and making that transition easy was beneficial, both 
> practically and politically.
>
> Vendor standards all disagreed on the upper range, and it would not 
> have been feasible to single out any of them. Nobody wanted to follow 
> the IBM code page 437 (then still the most widely used single byte 
> vendor standard).
>
>
> Note, that by "then" I refer to dates earlier than the dates of the 
> final drafts, because may of those decisions date back to earlier 
> periods where the drafts were first developed.Also, the overloading of 
> 0x80-0xFF by Windows did not happen all at once, earlier versions had 
> left much of that space open, but then people realized that as long as 
> you were still limited to 8 bits, throwing away 32 codes was an issue.
>
> Now, for Unicode, 32 out of 64K values (initially) or 1114112 (now), 
> don't matter, so being "clean" didn't cost much. (Note that even for 
> UTF-8, there's no special benefit of a value being inside that second 
> range of 128 codes.
>
> Finally, even if the range had not been dedicated to C1, the 32 codes 
> would have had to be given space, because the translation into ESC 
> sequences is not universal, so, in transcoding data you needed to have 
> a way to retain the difference between the raw code and the ESC 
> sequence, or your round-trip would not be lossless.
>
> A./

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151009/038a6233/attachment.html>


More information about the Unicode mailing list