Problems with mpiexec

Pete Wyckoff pw at osc.edu
Fri May 9 11:48:15 EDT 2003


bclem at rice.edu said on Thu, 08 May 2003 08:41 -0500:
> I run the command ./mpiexec ./hello  (using the hello program from the
> mpiexec src)
> 
> Well it will just pause and then finally give the following output
> 
> mpiexec: Warning: main: task 0 died with signal 9.
> mpiexec: Warning: main: task 1 died with signal 9.
> mpiexec: Warning: main: task 2 died with signal 9.
> mpiexec: Warning: main: task 3 died with signal 9.
>  
>  
> mpiexec: Warning: main: task 99 died with signal 9.
> 
> 
> I've run the mpiexec command with the -n parameter like this
> /mpiexec -n 40 ./hello (it works)
> /mpiexec -n 80 ./hello (it works)
> 
> but when I run ./mpiexec -n 98 (it dies)
> 
> Here is the output from that command where it dies using the -v flag
> 
> wait_tasks: numspawned = 99, got evt 199 for tid 1225 host n40 status
> 267
> wait_tasks: task 98 tid 1225 stray obit 0 while waiting for kill 199

These last lines are not so bad.  It means that mpiexec realized that
the parallel process was dying unhappily so it sent "kill -9" to the
rest of the tasks.  The "stray obit" can happen due to the race
condition of the process dying versus getting killed by mpiexec.

But why does it decide that your parallel job must die?  Usually it will
say something when this happens.  Can you run with lots of -v (say
three) and send me the output?  The rest of the list may not care about
that.  Also take a look at the "alarm(10)" in hello.c---each task gives
up if it does not manage to complete the initialization within that
time.  Perhaps things are just very slow?

I may ask you for some strace output of mpiexec later if the verbose
logging is not enlightening.

		-- Pete



More information about the mpiexec mailing list