EBCDIC control characters
sdowney at gmail.com
Fri Jun 19 16:56:33 CDT 2020
On Fri, Jun 19, 2020 at 5:08 PM Markus Scherer via Unicode
<unicode at unicode.org> wrote:
> I would soften a bit what Ken and Asmus have said.
> Of course C++ compilers have to deal with a variety of charsets/codepages. There is (or used to be) a lot of code in various Windows/Mac/Linux/... codepages, including variations of Shift-JIS, EUC-KR, etc.
> My mental model of how compilers work (which might be outdated) is that they work within a charset family (usually ASCII, but EBCDIC on certain platforms) and mostly parse ASCII characters as is (and for the "basic character set" in EBCDIC, mostly assume the byte values of cp37 or 1047 depending on platform). For regular string literals, I expect it's mostly a pass-through from the source code (and \xhh bytes) to the output binary.
What you described is the standard model for C compilers. For better
or worse, the C++ model is much more complicated. Note that what I'm
about to describe isn't how actual compilers work, but is what is
described in the C++ standard.
When translating a source file, all of the characters outside the
'basic source character set' (ascii letters, numbers, some necessary
punctuation) are converted to universal character names of the form
\unnnn or \Unnnnnnnn, where the ns are the short name of the code
point, and surrogate pairs are excluded, so really scalar values.
Later in translation, the universal character names, and the basic
source character set elements are mapped to the execution character
set, where the values are determined by locale. Which is terribly
vague and we'd like to clean that up. There are wide literals to deal
with, as well as the newer Unicode literals, where we've mandated the
encoding to be UTF of the appropriate code unit width, with distinct
types of char8_t, char16_t, and char32_t.
> But of course C++ has syntax for Unicode string literals. I think compilers basically call a system function to convert from the source bytes to Unicode, either with the process default charset or with an explicit one if specified on the command line.
> It would get more interesting if a compiler had options for different source and execution charsets. I don't know if they would convert regular string literals directly from one to the other, or if they convert everything to Unicode (like a Java compiler) and then to the execution charset. (In Java, the execution charset is UTF-16, so the problem space there is simpler.)
In practice, compilers behave sensibly and will map from the source to
the destination encodings. In theory they triangulate via code points.
This difference, of course, can be made visible by chosen text where
there are multiple possible destinations for a code point. In
practice, users do not care because they get the results they expect.
It's more a problem in specification.
> PS: I know that this really goes beyond string literals: C++ identifiers can include non-ASCII characters. I expect these to work much like regular string literals, minus escape sequences. I guess that the execution charset still plays a role for the linker symbol table.
Identifiers work substantially the same way, although with additional
restrictions. I'm currently working on a proposal to apply the current
UAX 31 to C++ to clean up the historical allow and block list. (
http://wg21.link/p1949 : C++ Identifier Syntax using Unicode Standard
Annex 31 )
I'll be posting some questions soon about that.
More information about the Unicode