EBCDIC control characters
Ken Whistler
kenwhistler at sonic.net
Thu Jun 18 13:00:12 CDT 2020
On 6/18/2020 8:54 AM, Corentin via Unicode wrote:
> Dear Unicode people.
>
> The C0 and C1 control blocks seems to have no intrinsic semantic, but
> the control characters
> of multiple characters sets (such as some of the ISO encodings, and
> the EBCDIC control characters) map to the same block of code points
> (for EBCDIC, a mapping is described in the UTF-EBCDIC UAX
UTR, actually, not a UAX:
https://www.unicode.org/reports/tr16/tr16-8.html
> - not sure if this mapping is described anywhere else)
Yes, in excruciating detail in the IBM Character Data Representation
Architecture:
https://www.ibm.com/downloads/cas/G01BQVRV
> such that a distinction between the different provenance is not
> possible, despite these control characters having potentially
> different semantic in their original character sets.
It isn't really a "character set" issue. Either ASCII graphic character
sets or EBCDIC graphic character sets could be used, in principle, with
different sets of control functions, mapped onto the control code
positions in each overall scheme. That is typically how character sets
worked in terminal environments.
What the IBM CDRA establishes is a reliable mapping between all the code
points used, so that it was possible to set up reliable interchange
between EBCDIC systems and ASCII-based systems.
There is one gotcha to watch out for, because there are two possible
ways to map newlines back and forth.
>
> Has this ever been an issue? Was it discussed at any point in history?
> Is there a recommended way of dealing with that?
>
> I realize the scenario in which this might be relevant is a bit
> far-fetched but as I try to push the C++ committee in the modern age,
> these questions, unfortunately, arised.
There really is no way for a C or C++ compiler to interpret arbitrary
control functions associated with control codes, in any case, other than
the specific control functions baked into the languages (which are
basically the same that the Unicode Standard insists should be nailed
down to particular code points: CR, LF, TAB, etc.). Other control code
points should be allowed (and not be messed with) in string literals,
and the compiler should otherwise barf if they occur in program text
where the language syntax doesn't allow it. And then compilers
supporting EBCDIC should just use the IBM standard for mapping back and
forth to ASCII-based values.
--Ken
More information about the Unicode
mailing list