Problems with mpiexec

Charland, Denis Denis.Charland at cnrc-nrc.gc.ca
Fri May 9 13:31:18 EDT 2003


Brent,

Are you using riser cards? I had similar problems with Myrinet cards on a
few nodes in our cluster. I finally found that the problem was the contact
between the riser card edge connector and the PCI connector on the
motherboard. I tried to clean the edge connector with a pencil eraser
without success. The problem was in the PCI connector (oxidation I suppose).
Since it was not really possible to clean the contactors, I removed and
reinserted the riser card in the PCI connector a dozen of times. It solved
the problem.

Denis

-----Original Message-----
From: Brent Clements [mailto:bclem at rice.edu]
Sent: Friday, May 09, 2003 11:51 AM
To: Pete Wyckoff
Cc: 
Subject: Re: Problems with mpiexec


Pete, I figured out what it was. We are having problems with the pci
buses on a few of the machines in our cluster. The myrinet card's all
show up as working but when you drill down and try to run something on
the card's, they fail. Debugging made us farther and harder made us
realize it was the pci buses on a few of those machines.

Thanks for the help though!!

-Brent



On Fri, 2003-05-09 at 10:48, Pete Wyckoff wrote:
> bclem at rice.edu said on Thu, 08 May 2003 08:41 -0500:
> > I run the command ./mpiexec ./hello  (using the hello program from the
> > mpiexec src)
> > 
> > Well it will just pause and then finally give the following output
> > 
> > mpiexec: Warning: main: task 0 died with signal 9.
> > mpiexec: Warning: main: task 1 died with signal 9.
> > mpiexec: Warning: main: task 2 died with signal 9.
> > mpiexec: Warning: main: task 3 died with signal 9.
> >  
> >  
> > mpiexec: Warning: main: task 99 died with signal 9.
> > 
> > 
> > I've run the mpiexec command with the -n parameter like this
> > /mpiexec -n 40 ./hello (it works)
> > /mpiexec -n 80 ./hello (it works)
> > 
> > but when I run ./mpiexec -n 98 (it dies)
> > 
> > Here is the output from that command where it dies using the -v flag
> > 
> > wait_tasks: numspawned = 99, got evt 199 for tid 1225 host n40 status
> > 267
> > wait_tasks: task 98 tid 1225 stray obit 0 while waiting for kill 199
> 
> These last lines are not so bad.  It means that mpiexec realized that
> the parallel process was dying unhappily so it sent "kill -9" to the
> rest of the tasks.  The "stray obit" can happen due to the race
> condition of the process dying versus getting killed by mpiexec.
> 
> But why does it decide that your parallel job must die?  Usually it will
> say something when this happens.  Can you run with lots of -v (say
> three) and send me the output?  The rest of the list may not care about
> that.  Also take a look at the "alarm(10)" in hello.c---each task gives
> up if it does not manage to complete the initialization within that
> time.  Perhaps things are just very slow?
> 
> I may ask you for some strace output of mpiexec later if the verbose
> logging is not enlightening.
> 
> 		-- Pete


_______________________________________________
mpiexec mailing list
mpiexec at osc.edu
http://email.osc.edu/mailman/listinfo/mpiexec



More information about the mpiexec mailing list