mpiexec and # of prcoessors

Pete Wyckoff pw at osc.edu
Fri Mar 31 10:44:00 EST 2006


bfp at purdue.edu wrote on Fri, 31 Mar 2006 09:31 -0500:
> I have a quick question regarding mpiexec-8.0 which we're using to run 
> some benchmarks (HPL).

We'll all be boring youngsters with stories about our pioneering
work on antique "clusters" when that version gets released.

> We've found that mpiexec works fine until we get to about 512+ nodes, and 
> then the parallel job fails for various reasons. Is there an adjustable 
> parameter in the mpiexec code that limits parallel jobs to 512 processors, 
> or do you think the problem is likely not mpiexec related?

The latest 0.80 release happened before a bunch of scalability
changes went in.  Debugging on Sandia's 4000-ish node (dual) cluster
led to a pretty major switch to asynchronous startup in the
InfiniBand code.  Problems arose around 2000 tasks.  If you're
using IB, please try http://www.osc.edu/~pw/mpiexec/mpiexec-0.81-pre3.tgz .

Other sources of scalability problems were:  poor mvapich startup
behavior, unfixable; unnecessary fsync() in torque at each task
startup, fixed with a configure option in later releases.

If you can offer a nice example of how it fails, maybe we'll be able
to pick out what particular problem you're having.  With "-v -v -v",
and "-nostdin -nostdout" to take IO out of the equation.

		-- Pete


More information about the mpiexec mailing list