EBCDIC control characters

Sat Jun 20 06:09:04 CDT 2020

On Sat, 20 Jun 2020 10:57:10 +0200
Corentin via Unicode <unicode at unicode.org> wrote:

> On Fri, 19 Jun 2020 at 22:30, Ken Whistler via Unicode
> <unicode at unicode.org> wrote:
> 
> >
> > On 6/19/2020 3:48 AM, Richard Wordingham via Unicode wrote:  
> > > Isn't there still the issue of supporting U+0000 in C-type
> > > strings?  
> >
> > I don't see why. And it has nothing to do with Unicode per se,
> > anyway.
> >
> > That is just a transform of the question of "the issue of
> > supporting 0x00 in
> > C-type strings restricted to ASCII."
> >
> > The issue is precisely the same, and the solutions are precisely
> > the same -- by design.
> >  
> 
> I'm not sure I understand that issue, could you clarify?
> in both C and C++, U+0000  is interpreted as the null character
> (which mark the end of the string depending on context), which is the
> same behavior as the equivalent
> ascii character

One immediate consequence of that assertion is that one cannot in
general store a line of Unicode text in a 'string'.  There have been
Unicode test cases that deliberately include a null in the middle of
the text, and if the program thinks it has stored the line in a
'string', it will fail the test, because the null character and beyond
are not part of the text being interpreted.

One of the early tricks to store general character sequences in
strings was to use non-shortest form UTF-8 encodings to avoid characters
being interpreted as control characters with undesired
characteristics.  This form of UTF-8 is now invalid.  Java was
especially noted for using the encoding <C0, 80> to store zero bytes in
byte code in UTF-8.

My guess is that Ken is alluding to not storing arbitrary text in
strings, but rather in arrays of code units along with appropriate
length parameters.

Richard.