EBCDIC control characters
corentin.jabot at gmail.com
Thu Jun 18 14:22:18 CDT 2020
On Thu, 18 Jun 2020 at 20:00, Ken Whistler <kenwhistler at sonic.net> wrote:
> On 6/18/2020 8:54 AM, Corentin via Unicode wrote:
> > Dear Unicode people.
> > The C0 and C1 control blocks seems to have no intrinsic semantic, but
> > the control characters
> > of multiple characters sets (such as some of the ISO encodings, and
> > the EBCDIC control characters) map to the same block of code points
> > (for EBCDIC, a mapping is described in the UTF-EBCDIC UAX
> UTR, actually, not a UAX:
> > - not sure if this mapping is described anywhere else)
> Yes, in excruciating detail in the IBM Character Data Representation
Thanks, I will have to read that !
> > such that a distinction between the different provenance is not
> > possible, despite these control characters having potentially
> > different semantic in their original character sets.
> It isn't really a "character set" issue. Either ASCII graphic character
> sets or EBCDIC graphic character sets could be used, in principle, with
> different sets of control functions, mapped onto the control code
> positions in each overall scheme. That is typically how character sets
> worked in terminal environments.
That makes sense !
> What the IBM CDRA establishes is a reliable mapping between all the code
> points used, so that it was possible to set up reliable interchange
> between EBCDIC systems and ASCII-based systems.
> There is one gotcha to watch out for, because there are two possible
> ways to map newlines back and forth.
> > Has this ever been an issue? Was it discussed at any point in history?
> > Is there a recommended way of dealing with that?
> > I realize the scenario in which this might be relevant is a bit
> > far-fetched but as I try to push the C++ committee in the modern age,
> > these questions, unfortunately, arised.
> There really is no way for a C or C++ compiler to interpret arbitrary
> control functions associated with control codes, in any case, other than
> the specific control functions baked into the languages (which are
> basically the same that the Unicode Standard insists should be nailed
> down to particular code points: CR, LF, TAB, etc.). Other control code
> points should be allowed (and not be messed with) in string literals,
> and the compiler should otherwise barf if they occur in program text
> where the language syntax doesn't allow it. And then compilers
> supporting EBCDIC should just use the IBM standard for mapping back and
> forth to ASCII-based values.
The specific case that people are talking about is indeed string literals
such as "\x06\u0086" where the hexadecimal escape is meant to be an ebcdic
character and the \uxxxx is meant to be be an unicode character such that
hexadecimal sequence would map to that character, and whether, in that very
odd scenario, they are or not the same character, and whether they should
Our current model is source encoding -> unicode -> literal encoding, all
three encodings being potentially distinct,
so we do in fact "mess with" string literals and the question is whether or
not going through unicode should ever considered destructive,
and my argument is that it is never destructive because semantically
preserving in all the relevant use cases.
The question was in particular whether we should use "a super set of
unicode" instead of "unicode" in that intermediate step.
Again thanks a lot for your reply!
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode