EBCDIC control characters

Thu Jun 18 13:00:12 CDT 2020

On 6/18/2020 8:54 AM, Corentin via Unicode wrote:
> Dear Unicode people.
>
> The C0 and C1 control blocks seems to have no intrinsic semantic, but 
> the control characters
> of multiple characters sets (such as some of the ISO encodings, and 
> the EBCDIC control characters) map to the same block of code points 
> (for EBCDIC, a mapping is described in the UTF-EBCDIC UAX

UTR, actually, not a UAX:

https://www.unicode.org/reports/tr16/tr16-8.html

> - not sure if this mapping is described anywhere else)

Yes, in excruciating detail in the IBM Character Data Representation 
Architecture:

https://www.ibm.com/downloads/cas/G01BQVRV

> such that a distinction between the different provenance is not 
> possible, despite these control characters having potentially 
> different semantic in their original character sets.

It isn't really a "character set" issue. Either ASCII graphic character 
sets or EBCDIC graphic character sets could be used, in principle, with 
different sets of control functions, mapped onto the control code 
positions in each overall scheme. That is typically how character sets 
worked in terminal environments.

What the IBM CDRA establishes is a reliable mapping between all the code 
points used, so that it was possible to set up reliable interchange 
between EBCDIC systems and ASCII-based systems.

There is one gotcha to watch out for, because there are two possible 
ways to map newlines back and forth.

>
> Has this ever been an issue? Was it discussed at any point in history?
> Is there a recommended way of dealing with that?
>
> I realize the scenario in which this might be relevant is a bit 
> far-fetched but as I try to push the C++ committee in the modern age, 
> these questions, unfortunately, arised.

There really is no way for a C or C++ compiler to interpret arbitrary 
control functions associated with control codes, in any case, other than 
the specific control functions baked into the languages (which are  
basically the same that the Unicode Standard insists should be nailed 
down to particular code points: CR, LF, TAB, etc.). Other control code 
points should be allowed (and not be messed with) in string literals, 
and the compiler should otherwise barf if they occur in program text 
where the language syntax doesn't allow it. And then compilers 
supporting EBCDIC should just use the IBM standard for mapping back and 
forth to ASCII-based values.

--Ken