EBCDIC control characters

Thu Jun 18 18:14:19 CDT 2020

On 6/18/2020 12:22 PM, Corentin wrote:
> The specific case that people are talking about is indeed string literals
> such as "\x06\u0086" where the hexadecimal escape is meant to be an 
> ebcdic character and the \uxxxx is meant to be be an unicode character 
> such that the
> hexadecimal sequence would map to that character, and whether, in that 
> very odd scenario, they are or not the same character, and 
> whether they should be distinguishable

Well, with the caveat that I am not a formal languge designer -- I just 
use them on T.V.... ;-)

My opinion is that such constructs should simply be illegal and/or 
non-syntactical. The whole idea of letting people import the complexity 
of character set conversion (particularly extended to the 
incompatibility between EBCDIC and ASCII-based representation) into 
string literals strikes me as just daft.

If program text is to be interpreted and compiled in an EBCDIC 
environment, any string literals contained in that source text should be 
constrained to EBCDIC, period, full stop. (0x4B, 0x4B) And if they 
contain more than the very restricted EBCDIC set of A..Z, a..z, 0..9 and 
a few common punctuation, then it better all be in one well-supported 
EBCDIC extended code page such as CP 500.

If program text is to be interpreted and compiled in a Unicode 
environment, any string literals contained in that source text should be 
constrained to Unicode, period, full stop (U+002E, U+002E). And for 
basic, 8-bit char strings, it better all be UTF-8 these days. UTF-16 and 
UTF-32 also work, of course, but IMO, support for those is best handled 
by depending on libraries such as ICU, rather than expecting that the 
programming language and runtime libraries are going to support them as 
well as char* UTF-8 strings.

If program source text has to be cross-compiled in both an EBCDIC and a 
Unicode environment, the only sane approach to extract all but the bare 
minimum of string literals to various kinds of resource files which can 
then be independently manipulated and pushed through character 
conversions, as needed -- not expecting that the *compiler* is going to 
suddenly get smart and do the right thing every time it encounters some 
otherwise untagged string literal sitting in program text. That's a 
whole lot cleaner than doing a whole bunch of conditional compilation 
and working with string literals in program text that are always going 
to be half-gibberish on whichever platform you view it for maintenance. 
I had to do some EBCDIC/ASCII cross-compiled code development once -- 
although admittedly 20 years ago. It wasn't pretty.

>
> Our current model is source encoding -> unicode -> literal encoding, 
> all three encodings being potentially distinct,
> so we do in fact "mess with" string literals and the question is 
> whether or not going through unicode should ever considered destructive,
Answer, no. If somebody these days is trying to do software development 
work in a one-off, niche character encoding that cannot be fully 
converted to Unicode, then *they* are daft.
> and my argument is that it is never destructive because semantically 
> preserving in all the relevant use cases.
>
> The question was in particular whether we should use "a super set of 
> unicode" instead of "unicode" in that intermediate step.

Answer no. That will cause you nothing but trouble going forward.

All my opinions, of course. YMMV. But probably not by a lot. ;-)

--Ken