larger jobs fail regularly

Troy Baer troy at osc.edu
Tue May 8 11:44:37 EDT 2007


On Tue, 2007-05-08 at 17:38 +0200, Thomas Zeiser wrote:
> starting larger jobs (128 CPUs on 32 nodes) fails quit often on our
> new system with messages which seem to indicate a problem in the 
> communication of mpiexec with torque.
> We are running SuSE SLES9 (x96_64), torque-2.1.6/2.1.8 and use mpiexec-0.82.

The error messages in your message show several PMI errors, so I suspect
there may be a bad interaction between mpiexec and your MPI library.
What MPI implementation are you using?

	--Troy
-- 
Troy Baer                       troy at osc.edu
Science & Technology Support    http://www.osc.edu/hpc/
Ohio Supercomputer Center       614-292-9701




More information about the mpiexec mailing list