torque+mpiexec+mvapich = strange behavior

Anton Starikov A.Starikov at utwente.nl
Mon Dec 6 10:10:43 EST 2004


> There is only one process when prepare_ib_startup_port is called, but
> the code soon forks after that.  There is no difference in mpiexec for
> interactive or non-interactive.
I thought the same, but I've add printf after line
mport_fd = socket(PF_INET, SOCK_STREAM, 0);
in  prepare_ib_startup_port, and this is passed twice for
mpiexec -np 1 bla-bla-bla

Second, it is even not change of shell.
It just different behavior of torque.
If script started from #! it is treated as non-executable.
and when there #!/path/to/shell - it will be executed.

So, basically in first case they invoke shell and give (via pipe more or 
less) lines from script to it. So, equivalent of
"cat my_job_script | bash"

In second case something like "/bin/bash my_job_script" is executed.

That's only one difference.
I tried with both, bash and sh in any combinations, but as result I can 
say that  normal execution is depend only on first line of script. And 
it is irellevant which shell you use. It seems that only way of 
executing process matter. If you show to torque that script is 
executable - it works. If not - doesn't.

I'll try your suggestions.

Anton

> Because of your observation that changing your shell changes the
> behavior and your feelings about the fork :), I'm worried about a
> certain close in stdio.c.  The parent does close(0) unconditionally, but
> perhaps this is not correct.  Can you add, around stdio.c:331, the two
> debugging printfs below (untested):
> 
>     if (pid > 0) {
>         /* parent: do not listen to stdin but
>          * leave 1,2 open for debugging/error output (to pbs batch output files
>          * or to tty for interactive)
>          */
> 	printf("%s: pre-close-0 aggregate = %d %d %d\n", __func__,
> 	  aggregate[0], aggregate[1], aggregate[2]);
> 	printf("%s: abort_fd_in = %d %d\n", __func__,
> 	  abort_fd_in[0], abort_fd_in[1]);
>         close(0);
> 
> If we see that abort_fd_in[0] == 0, and aggregate[0] == -1, maybe we
> should pay attention to those instead of running straight to close(0).
> Or it could be a completely different problem.
> 
> What is your default shell, by the way, if not /bin/bash?  Do you
> specify any "-S" lines in your PBS script or on the command-line to
> qsub?
> 
> 		-- Pete
> 
> 




More information about the mpiexec mailing list