EBCDIC control characters

Richard Wordingham richard.wordingham at ntlworld.com
Sat Jun 20 09:32:19 CDT 2020


On Sat, 20 Jun 2020 14:11:15 +0200
Corentin via Unicode <unicode at unicode.org> wrote:

> On Sat, 20 Jun 2020 at 13:14, Richard Wordingham via Unicode <
> unicode at unicode.org> wrote:  
> 
> > On Sat, 20 Jun 2020 10:57:10 +0200
> > Corentin via Unicode <unicode at unicode.org> wrote:
> >  
> > > On Fri, 19 Jun 2020 at 22:30, Ken Whistler via Unicode
> > > <unicode at unicode.org> wrote:
> > >  
> > > >
> > > > On 6/19/2020 3:48 AM, Richard Wordingham via Unicode wrote:  
> > > > > Isn't there still the issue of supporting U+0000 in C-type
> > > > > strings?  
> > > >
> > > > I don't see why. And it has nothing to do with Unicode per se,
> > > > anyway.
> > > >
> > > > That is just a transform of the question of "the issue of
> > > > supporting 0x00 in
> > > > C-type strings restricted to ASCII."
> > > >
> > > > The issue is precisely the same, and the solutions are precisely
> > > > the same -- by design.
> > > >  
> > >
> > > I'm not sure I understand that issue, could you clarify?
> > > in both C and C++, U+0000  is interpreted as the null character
> > > (which mark the end of the string depending on context), which is
> > > the same behavior as the equivalent
> > > ascii character  
> >
> > One immediate consequence of that assertion is that one cannot in
> > general store a line of Unicode text in a 'string'.  There have been
> > Unicode test cases that deliberately include a null in the middle of
> > the text, and if the program thinks it has stored the line in a
> > 'string', it will fail the test, because the null character and
> > beyond are not part of the text being interpreted.
> >
> > One of the early tricks to store general character sequences in
> > strings was to use non-shortest form UTF-8 encodings to avoid
> > characters being interpreted as control characters with undesired
> > characteristics.  This form of UTF-8 is now invalid.  Java was
> > especially noted for using the encoding <C0, 80> to store zero
> > bytes in byte code in UTF-8.
> >
> > My guess is that Ken is alluding to not storing arbitrary text in
> > strings, but rather in arrays of code units along with appropriate
> > length parameters.
> >  
> 
> Oh, yes, I see thanks.
> It's a special case of "null-terminated strings were a mistake".
> But U+0000 has no other use or alternative semantic right? The main
> use case would be test cases?

I believe Unicode doesn't define its semantics, but rather defers by
default to ECMA-48.  Of NUL it says, "NUL is used for media-fill or
time-fill. NUL characters may be inserted into, or removed from, a data
stream without affecting the information content of that stream, but
such action may affect the information layout and/or the control of
equipment."

I have used it for easy composition of Fortran output lines from
CHARACTER variables; the NULs in the resulting lines were simply
ignored when the output was display on a terminal or line printer.  The
Fortran 90 intrinisic function TRIM provided an easier and more
reliable way of doing the same job; embedded NULs don't play well with
C.

Richard.


More information about the Unicode mailing list