mpiexec and # of prcoessors

Bryan Putnam bfp at purdue.edu
Fri Mar 31 11:00:33 EST 2006


On Fri, 31 Mar 2006, Pete Wyckoff wrote:

> bfp at purdue.edu wrote on Fri, 31 Mar 2006 09:31 -0500:
> > I have a quick question regarding mpiexec-8.0 which we're using to run 
> > some benchmarks (HPL).
> 
> We'll all be boring youngsters with stories about our pioneering
> work on antique "clusters" when that version gets released.

Oh yes, version 0.80 is what we're using, :-) I get back to you with some 
more specific error information. No, we're not using infiniband.

Thanks!
Bryan

> 
> > We've found that mpiexec works fine until we get to about 512+ nodes, and 
> > then the parallel job fails for various reasons. Is there an adjustable 
> > parameter in the mpiexec code that limits parallel jobs to 512 processors, 
> > or do you think the problem is likely not mpiexec related?
> 
> The latest 0.80 release happened before a bunch of scalability
> changes went in.  Debugging on Sandia's 4000-ish node (dual) cluster
> led to a pretty major switch to asynchronous startup in the
> InfiniBand code.  Problems arose around 2000 tasks.  If you're
> using IB, please try http://www.osc.edu/~pw/mpiexec/mpiexec-0.81-pre3.tgz .
> 
> Other sources of scalability problems were:  poor mvapich startup
> behavior, unfixable; unnecessary fsync() in torque at each task
> startup, fixed with a configure option in later releases.
> 
> If you can offer a nice example of how it fails, maybe we'll be able
> to pick out what particular problem you're having.  With "-v -v -v",
> and "-nostdin -nostdout" to take IO out of the equation.
> 
> 		-- Pete
> 




More information about the mpiexec mailing list