Problems with mpiexec
Brent Clements
bclem at rice.edu
Fri May 9 11:50:39 EDT 2003
Pete, I figured out what it was. We are having problems with the pci
buses on a few of the machines in our cluster. The myrinet card's all
show up as working but when you drill down and try to run something on
the card's, they fail. Debugging made us farther and harder made us
realize it was the pci buses on a few of those machines.
Thanks for the help though!!
-Brent
On Fri, 2003-05-09 at 10:48, Pete Wyckoff wrote:
> bclem at rice.edu said on Thu, 08 May 2003 08:41 -0500:
> > I run the command ./mpiexec ./hello (using the hello program from the
> > mpiexec src)
> >
> > Well it will just pause and then finally give the following output
> >
> > mpiexec: Warning: main: task 0 died with signal 9.
> > mpiexec: Warning: main: task 1 died with signal 9.
> > mpiexec: Warning: main: task 2 died with signal 9.
> > mpiexec: Warning: main: task 3 died with signal 9.
> >
> >
> > mpiexec: Warning: main: task 99 died with signal 9.
> >
> >
> > I've run the mpiexec command with the -n parameter like this
> > /mpiexec -n 40 ./hello (it works)
> > /mpiexec -n 80 ./hello (it works)
> >
> > but when I run ./mpiexec -n 98 (it dies)
> >
> > Here is the output from that command where it dies using the -v flag
> >
> > wait_tasks: numspawned = 99, got evt 199 for tid 1225 host n40 status
> > 267
> > wait_tasks: task 98 tid 1225 stray obit 0 while waiting for kill 199
>
> These last lines are not so bad. It means that mpiexec realized that
> the parallel process was dying unhappily so it sent "kill -9" to the
> rest of the tasks. The "stray obit" can happen due to the race
> condition of the process dying versus getting killed by mpiexec.
>
> But why does it decide that your parallel job must die? Usually it will
> say something when this happens. Can you run with lots of -v (say
> three) and send me the output? The rest of the list may not care about
> that. Also take a look at the "alarm(10)" in hello.c---each task gives
> up if it does not manage to complete the initialization within that
> time. Perhaps things are just very slow?
>
> I may ask you for some strace output of mpiexec later if the verbose
> logging is not enlightening.
>
> -- Pete
More information about the mpiexec
mailing list