EBCDIC control characters

Corentin corentin.jabot at gmail.com
Sat Jun 20 03:50:28 CDT 2020


On Fri, 19 Jun 2020 at 23:00, Markus Scherer <markus.icu at gmail.com> wrote:

> I would soften a bit what Ken and Asmus have said.
>
> Of course C++ compilers have to deal with a variety of charsets/codepages.
> There is (or used to be) a lot of code in various Windows/Mac/Linux/...
> codepages, including variations of Shift-JIS, EUC-KR, etc.
>
> My mental model of how compilers work (which might be outdated) is that
> they work within a charset family (usually ASCII, but EBCDIC on certain
> platforms) and mostly parse ASCII characters as is (and for the "basic
> character set" in EBCDIC, mostly assume the byte values of cp37 or 1047
> depending on platform). For regular string literals, I expect it's mostly a
> pass-through from the source code (and \xhh bytes) to the output binary.
>
> But of course C++ has syntax for Unicode string literals. I think
> compilers basically call a system function to convert from the source bytes
> to Unicode, either with the process default charset or with an explicit one
> if specified on the command line.
>
> And then there are \uhhhh and \U00HHHHHH escapes even in non-Unicode
> string literals, as Corentin said. What I would expect to happen is that
> the compiler copies all of the literal bytes, and when it reads a Unicode
> escape it converts that one code point to the byte sequence in the default
> or execution-charset.
>
> It would get more interesting if a compiler had options for different
> source and execution charsets. I don't know if they would convert regular
> string literals directly from one to the other, or if they convert
> everything to Unicode (like a Java compiler) and then to the execution
> charset. (In Java, the execution charset is UTF-16, so the problem space
> there is simpler.)
>

Yes, and actually people are talking about that for legacy projects sake,
and there using Unicode internally makes even more sense


>
> Of course, in many cases a conversion from A to B will pivot through
> Unicode anyway (so that you only need 2n tables not n^2.)
>
> About character conversion in general I would caution that there are
> basically two types of mappings: Round-trip mappings for what's really the
> same character on both sides, and fallbacks where you map to a different
> but more or less similar/related character because that may be more
> readable than a question mark or a replacement character. In a compiler, I
> would hope that both unmappable characters and fallback mappings lead to
> compiler errors, to avoid hidden surprises in runtime behavior.
>

I am hoping to make conversions that do not preserve semantic invalid,
right now compilers will behave differently, some will not compile, some
will insert question marks, leading to the runtime issues you describe.

Now, my argument is that going through Unicode ( and keep in mind that we
are describing a specification not compiler implementations ), let us
simplify the spec without
preventing (nor mandating ) round tripping if the source and literal
encodings happen to be the same. If there is a way through unicode,
transitively there is a direct way.

This probably constrains what the compiler can and should do. As a
> programmer, I want to be able to put any old byte sequence into a string
> literal, including NUL, controls, and non-character-encoding bytes. (We use
> string literals for more things than "text".) For example, when we didn't
> yet have syntax for UTF-8 string literals, we could write unmarked literals
> with \xhh sequences and pass them into functions that explicitly operated
> on UTF-8, regardless of whether those byte sequences were well-formed
> according to the source or execution charsets. This pretty much works only
> if there is no conversion that puts limits on the contents.
>

Okay, we are really in C++ territory now. for the sake of people who are
not aware the \0 and \x escape sequences are really integer values and will
never be semantically characters or involve conversion.


>
> I believe that EBCDIC platforms have dealt with this, where necessary, by
> using single-byte conversion mappings between EBCDIC-based and ASCII-based
> codepages that were strict permutations. Thus, control codes and other byte
> values would round-trip through any number of conversions back and forth.
>
> PS: I know that this really goes beyond string literals: C++ identifiers
> can include non-ASCII characters. I expect these to work much like regular
> string literals, minus escape sequences. I guess that the execution charset
> still plays a role for the linker symbol table.
>
> Best regards,
> markus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200620/ae8179e9/attachment.htm>


More information about the Unicode mailing list