larger jobs fail regularly

Thomas Zeiser thomas.zeiser at rrze.uni-erlangen.de
Tue May 8 11:38:37 EDT 2007


Dear All,

starting larger jobs (128 CPUs on 32 nodes) fails quit often on our
new system with messages which seem to indicate a problem in the 
communication of mpiexec with torque.
We are running SuSE SLES9 (x96_64), torque-2.1.6/2.1.8 and use mpiexec-0.82.

If I do something like
  #!/bin/sh
  #PBS -l nodes=32:ppn=4
  mpiexec -comm pmi [-v -v -kill -nostdout -nostdin] ./a.out -some args
the jobs sometimes runs perfectly fine, but in most of the cases
(running on the very same nodes) fails quite often with messages like
(much shorted):

mpiexec: Warning: poll_or_block_event: evt 122 remote system error.
mpiexec: Error: handle_pmi: response cmd=barrier_out: Bad file descriptor.
[cli_88]: [cli_41]: PMIU_parse_keyvals: unexpected key delimiter at character 1 in 0
[cli_41]: expecting cmd=barrier_out, got 0
[cli_41]: PMIU_parse_keyvals: unexpected key delimiter at character 1 in 0
[cli_41]: expecting cmd=barrier_out, got 0
[cli_41]: got unexpe[cli_29]: PMIU_parse_keyvals: unexpected key delimiter at character 1 in 0
_88]: got unexpected respocted response to get :cmd=get kvsname=31945.wadm1.rrze.uni-erlangen.de-spawn-0 key=r2h0
:
[cli_41]: PMIU_parse_keyvals: unexpected key delimiter at character 5 in @223ó225*
[cli_41]: expecting cmd=appnum, got @223ó225*
cted response to get :cmd=get kvsname=31945.wadm1.rrze.uni-erlangen.de-spawn-0 key=r2h0
_21]: PMIU_parse_keyvals: unexpected key delimiter at character 5 in @223ó225*
[cli_21]: expecting cmd=appnum, got @223ó225*
r; fd=6 buf=:cmd=get kvsname=31945.wadm1.rrze.uni-erlangen.de-spawn-0 key=r2h0
:
system msg for write_line failure : Broken pipe
[cli_2]: got unexpected response to get :cmd=get kvsname=31945.wadm1.rrze.uni-erlangen.de-spawn-0 key=r2h0
:
[cli_2]: write_lincted response to get :cmd=get kvsname=31945.wadm1.rrze.uni-erlangen.de-spawn-0 key=r2h0
:
[cli_33]: PMIU_parse_keyvals: unexpected key delimiter at character 5 in @223ó225*

mpiexec: process_start_event: evt 13 task 11 on w0231.
mpiexec: Warning: poll_or_block_event: evt 126 remote system error.
mpiexec: process_start_event: evt 126 task 124 on w0202.
mpiexec: process_start_event: task 124 on w0202 too fast, no obit.
mpiexec: kill_tasks: killing all tasks.
mpiexec: kill_tasks: kill my task 0 on w0233.
mpiexec: kill_tasks: tried to kill my already dead task 47.


I'm not sure which part of the vast information produced by mpiexec
-v or logged by the pbs_moms is really important. Moreover, the
problem does not occure always and only for larger jobs making it
harder to debug. Using for example only 3 CPUs of the 4 per node never
triggered the abort.

Any hints how to tackle the problem?

Regards,

Thomas Zeiser


More information about the mpiexec mailing list