Problems with mpiexec

Brent Clements bclem at rice.edu
Fri May 9 13:34:25 EDT 2003


Denis,

This machines should not have oxidation already since we just purchased
the machine(hp itanium zx6000's). The HP Service rep is coming out today
to replace the motherboards. hopefully that will fix the problem.

Thanks for the help and I'll keep your suggestion in mind the next time
this happens in the future.

Thanks,
Brent

On Fri, 2003-05-09 at 12:31, Charland, Denis wrote:
> 
> Brent,
> 
> Are you using riser cards? I had similar problems with Myrinet cards on a
> few nodes in our cluster. I finally found that the problem was the contact
> between the riser card edge connector and the PCI connector on the
> motherboard. I tried to clean the edge connector with a pencil eraser
> without success. The problem was in the PCI connector (oxidation I suppose).
> Since it was not really possible to clean the contactors, I removed and
> reinserted the riser card in the PCI connector a dozen of times. It solved
> the problem.
> 
> Denis
> 
> -----Original Message-----
> From: Brent Clements [mailto:bclem at rice.edu]
> Sent: Friday, May 09, 2003 11:51 AM
> To: Pete Wyckoff
> Cc: 
> Subject: Re: Problems with mpiexec
> 
> 
> Pete, I figured out what it was. We are having problems with the pci
> buses on a few of the machines in our cluster. The myrinet card's all
> show up as working but when you drill down and try to run something on
> the card's, they fail. Debugging made us farther and harder made us
> realize it was the pci buses on a few of those machines.
> 
> Thanks for the help though!!
> 
> -Brent
> 
> 
> 
> On Fri, 2003-05-09 at 10:48, Pete Wyckoff wrote:
> > bclem at rice.edu said on Thu, 08 May 2003 08:41 -0500:
> > > I run the command ./mpiexec ./hello  (using the hello program from the
> > > mpiexec src)
> > > 
> > > Well it will just pause and then finally give the following output
> > > 
> > > mpiexec: Warning: main: task 0 died with signal 9.
> > > mpiexec: Warning: main: task 1 died with signal 9.
> > > mpiexec: Warning: main: task 2 died with signal 9.
> > > mpiexec: Warning: main: task 3 died with signal 9.
> > >  
> > >  
> > > mpiexec: Warning: main: task 99 died with signal 9.
> > > 
> > > 
> > > I've run the mpiexec command with the -n parameter like this
> > > /mpiexec -n 40 ./hello (it works)
> > > /mpiexec -n 80 ./hello (it works)
> > > 
> > > but when I run ./mpiexec -n 98 (it dies)
> > > 
> > > Here is the output from that command where it dies using the -v flag
> > > 
> > > wait_tasks: numspawned = 99, got evt 199 for tid 1225 host n40 status
> > > 267
> > > wait_tasks: task 98 tid 1225 stray obit 0 while waiting for kill 199
> > 
> > These last lines are not so bad.  It means that mpiexec realized that
> > the parallel process was dying unhappily so it sent "kill -9" to the
> > rest of the tasks.  The "stray obit" can happen due to the race
> > condition of the process dying versus getting killed by mpiexec.
> > 
> > But why does it decide that your parallel job must die?  Usually it will
> > say something when this happens.  Can you run with lots of -v (say
> > three) and send me the output?  The rest of the list may not care about
> > that.  Also take a look at the "alarm(10)" in hello.c---each task gives
> > up if it does not manage to complete the initialization within that
> > time.  Perhaps things are just very slow?
> > 
> > I may ask you for some strace output of mpiexec later if the verbose
> > logging is not enlightening.
> > 
> > 		-- Pete
> 
> 
> _______________________________________________
> mpiexec mailing list
> mpiexec at osc.edu
> http://email.osc.edu/mailman/listinfo/mpiexec





More information about the mpiexec mailing list