mpiexec and # of prcoessors
Pete Wyckoff
pw at osc.edu
Fri Mar 31 10:44:00 EST 2006
bfp at purdue.edu wrote on Fri, 31 Mar 2006 09:31 -0500:
> I have a quick question regarding mpiexec-8.0 which we're using to run
> some benchmarks (HPL).
We'll all be boring youngsters with stories about our pioneering
work on antique "clusters" when that version gets released.
> We've found that mpiexec works fine until we get to about 512+ nodes, and
> then the parallel job fails for various reasons. Is there an adjustable
> parameter in the mpiexec code that limits parallel jobs to 512 processors,
> or do you think the problem is likely not mpiexec related?
The latest 0.80 release happened before a bunch of scalability
changes went in. Debugging on Sandia's 4000-ish node (dual) cluster
led to a pretty major switch to asynchronous startup in the
InfiniBand code. Problems arose around 2000 tasks. If you're
using IB, please try http://www.osc.edu/~pw/mpiexec/mpiexec-0.81-pre3.tgz .
Other sources of scalability problems were: poor mvapich startup
behavior, unfixable; unnecessary fsync() in torque at each task
startup, fixed with a configure option in later releases.
If you can offer a nice example of how it fails, maybe we'll be able
to pick out what particular problem you're having. With "-v -v -v",
and "-nostdin -nostdout" to take IO out of the equation.
-- Pete
More information about the mpiexec
mailing list