larger jobs fail regularly
Troy Baer
troy at osc.edu
Tue May 8 11:44:37 EDT 2007
On Tue, 2007-05-08 at 17:38 +0200, Thomas Zeiser wrote:
> starting larger jobs (128 CPUs on 32 nodes) fails quit often on our
> new system with messages which seem to indicate a problem in the
> communication of mpiexec with torque.
> We are running SuSE SLES9 (x96_64), torque-2.1.6/2.1.8 and use mpiexec-0.82.
The error messages in your message show several PMI errors, so I suspect
there may be a bad interaction between mpiexec and your MPI library.
What MPI implementation are you using?
--Troy
--
Troy Baer troy at osc.edu
Science & Technology Support http://www.osc.edu/hpc/
Ohio Supercomputer Center 614-292-9701
More information about the mpiexec
mailing list