EBCDIC control characters
Ken Whistler
kenwhistler at sonic.net
Thu Jun 18 18:14:19 CDT 2020
On 6/18/2020 12:22 PM, Corentin wrote:
> The specific case that people are talking about is indeed string literals
> such as "\x06\u0086" where the hexadecimal escape is meant to be an
> ebcdic character and the \uxxxx is meant to be be an unicode character
> such that the
> hexadecimal sequence would map to that character, and whether, in that
> very odd scenario, they are or not the same character, and
> whether they should be distinguishable
Well, with the caveat that I am not a formal languge designer -- I just
use them on T.V.... ;-)
My opinion is that such constructs should simply be illegal and/or
non-syntactical. The whole idea of letting people import the complexity
of character set conversion (particularly extended to the
incompatibility between EBCDIC and ASCII-based representation) into
string literals strikes me as just daft.
If program text is to be interpreted and compiled in an EBCDIC
environment, any string literals contained in that source text should be
constrained to EBCDIC, period, full stop. (0x4B, 0x4B) And if they
contain more than the very restricted EBCDIC set of A..Z, a..z, 0..9 and
a few common punctuation, then it better all be in one well-supported
EBCDIC extended code page such as CP 500.
If program text is to be interpreted and compiled in a Unicode
environment, any string literals contained in that source text should be
constrained to Unicode, period, full stop (U+002E, U+002E). And for
basic, 8-bit char strings, it better all be UTF-8 these days. UTF-16 and
UTF-32 also work, of course, but IMO, support for those is best handled
by depending on libraries such as ICU, rather than expecting that the
programming language and runtime libraries are going to support them as
well as char* UTF-8 strings.
If program source text has to be cross-compiled in both an EBCDIC and a
Unicode environment, the only sane approach to extract all but the bare
minimum of string literals to various kinds of resource files which can
then be independently manipulated and pushed through character
conversions, as needed -- not expecting that the *compiler* is going to
suddenly get smart and do the right thing every time it encounters some
otherwise untagged string literal sitting in program text. That's a
whole lot cleaner than doing a whole bunch of conditional compilation
and working with string literals in program text that are always going
to be half-gibberish on whichever platform you view it for maintenance.
I had to do some EBCDIC/ASCII cross-compiled code development once --
although admittedly 20 years ago. It wasn't pretty.
>
> Our current model is source encoding -> unicode -> literal encoding,
> all three encodings being potentially distinct,
> so we do in fact "mess with" string literals and the question is
> whether or not going through unicode should ever considered destructive,
Answer, no. If somebody these days is trying to do software development
work in a one-off, niche character encoding that cannot be fully
converted to Unicode, then *they* are daft.
> and my argument is that it is never destructive because semantically
> preserving in all the relevant use cases.
>
> The question was in particular whether we should use "a super set of
> unicode" instead of "unicode" in that intermediate step.
Answer no. That will cause you nothing but trouble going forward.
All my opinions, of course. YMMV. But probably not by a lot. ;-)
--Ken
More information about the Unicode
mailing list