EBCDIC control characters

Fri Jun 19 16:00:21 CDT 2020

I would soften a bit what Ken and Asmus have said.

Of course C++ compilers have to deal with a variety of charsets/codepages.
There is (or used to be) a lot of code in various Windows/Mac/Linux/...
codepages, including variations of Shift-JIS, EUC-KR, etc.

My mental model of how compilers work (which might be outdated) is that
they work within a charset family (usually ASCII, but EBCDIC on certain
platforms) and mostly parse ASCII characters as is (and for the "basic
character set" in EBCDIC, mostly assume the byte values of cp37 or 1047
depending on platform). For regular string literals, I expect it's mostly a
pass-through from the source code (and \xhh bytes) to the output binary.

But of course C++ has syntax for Unicode string literals. I think compilers
basically call a system function to convert from the source bytes to
Unicode, either with the process default charset or with an explicit one if
specified on the command line.

And then there are \uhhhh and \U00HHHHHH escapes even in non-Unicode string
literals, as Corentin said. What I would expect to happen is that the
compiler copies all of the literal bytes, and when it reads a Unicode
escape it converts that one code point to the byte sequence in the default
or execution-charset.

It would get more interesting if a compiler had options for different
source and execution charsets. I don't know if they would convert regular
string literals directly from one to the other, or if they convert
everything to Unicode (like a Java compiler) and then to the execution
charset. (In Java, the execution charset is UTF-16, so the problem space
there is simpler.)

Of course, in many cases a conversion from A to B will pivot through
Unicode anyway (so that you only need 2n tables not n^2.)

About character conversion in general I would caution that there are
basically two types of mappings: Round-trip mappings for what's really the
same character on both sides, and fallbacks where you map to a different
but more or less similar/related character because that may be more
readable than a question mark or a replacement character. In a compiler, I
would hope that both unmappable characters and fallback mappings lead to
compiler errors, to avoid hidden surprises in runtime behavior.

This probably constrains what the compiler can and should do. As a
programmer, I want to be able to put any old byte sequence into a string
literal, including NUL, controls, and non-character-encoding bytes. (We use
string literals for more things than "text".) For example, when we didn't
yet have syntax for UTF-8 string literals, we could write unmarked literals
with \xhh sequences and pass them into functions that explicitly operated
on UTF-8, regardless of whether those byte sequences were well-formed
according to the source or execution charsets. This pretty much works only
if there is no conversion that puts limits on the contents.

I believe that EBCDIC platforms have dealt with this, where necessary, by
using single-byte conversion mappings between EBCDIC-based and ASCII-based
codepages that were strict permutations. Thus, control codes and other byte
values would round-trip through any number of conversions back and forth.

PS: I know that this really goes beyond string literals: C++ identifiers
can include non-ASCII characters. I expect these to work much like regular
string literals, minus escape sequences. I guess that the execution charset
still plays a role for the linker symbol table.

Best regards,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200619/a89eea24/attachment-0001.htm>