mpiexec and # of prcoessors
Bryan Putnam
bfp at purdue.edu
Fri Mar 31 11:00:33 EST 2006
On Fri, 31 Mar 2006, Pete Wyckoff wrote:
> bfp at purdue.edu wrote on Fri, 31 Mar 2006 09:31 -0500:
> > I have a quick question regarding mpiexec-8.0 which we're using to run
> > some benchmarks (HPL).
>
> We'll all be boring youngsters with stories about our pioneering
> work on antique "clusters" when that version gets released.
Oh yes, version 0.80 is what we're using, :-) I get back to you with some
more specific error information. No, we're not using infiniband.
Thanks!
Bryan
>
> > We've found that mpiexec works fine until we get to about 512+ nodes, and
> > then the parallel job fails for various reasons. Is there an adjustable
> > parameter in the mpiexec code that limits parallel jobs to 512 processors,
> > or do you think the problem is likely not mpiexec related?
>
> The latest 0.80 release happened before a bunch of scalability
> changes went in. Debugging on Sandia's 4000-ish node (dual) cluster
> led to a pretty major switch to asynchronous startup in the
> InfiniBand code. Problems arose around 2000 tasks. If you're
> using IB, please try http://www.osc.edu/~pw/mpiexec/mpiexec-0.81-pre3.tgz .
>
> Other sources of scalability problems were: poor mvapich startup
> behavior, unfixable; unnecessary fsync() in torque at each task
> startup, fixed with a configure option in later releases.
>
> If you can offer a nice example of how it fails, maybe we'll be able
> to pick out what particular problem you're having. With "-v -v -v",
> and "-nostdin -nostdout" to take IO out of the equation.
>
> -- Pete
>
More information about the mpiexec
mailing list