Failure to start mpich/p4 on SMPs
Ben Webb
ben at bellatrix.pcl.ox.ac.uk
Thu Jan 31 13:18:26 EST 2002
There appears to be a bug in the interaction between mpich/p4 and
mpiexec when running a job that is spread across several SMPs. For
example, on our cluster of dual-processor Linux boxes, a job submitted
with "-l nodes=cluster1:ppn=2+cluster2:ppn=2" will always fail with a
message similar to the following:
bm_list_27535: p4_error: lookup_slave_index_by_pid: %d not found: 27533
p0_27533: (0.098094) net_send: could not write to fd=4, errno = 32
"-l nodes=cluster1:ppn=1+cluster2:ppn=2" also fails, but
"-l nodes=cluster1:ppn=2+cluster2:ppn=1" works quite happily. In fact,
failure seems to occur whenever a slave node has two processes attempting
to run, so I can only assume this is some kind of race condition
manifesting itself (although I thought this was what the new listener
code was supposed to remedy). Has anybody else observed this behaviour?
I'm using mpiexec-0.65 (--with-smp-size=2) and mpich-1.2.3
(--with-device=ch_p4 --with-comm=shared) under RedHat 7.2. "mpirun" with
the PBS node file (via. serv_p4) runs with no problems whatsoever.
mpiexec with --comm=none also works OK. I am also unable to reproduce this
behaviour with the MPICH test programs - it only occurs with the parallel
version of CHARMM, built against mpich-1.2.3 - which is what leads to me
to suspect a race condition.
Ideas?
Ben
--
ben at bellatrix.pcl.ox.ac.uk http://bellatrix.pcl.ox.ac.uk/~ben/
"Verbosity leads to unclear, inarticulate things."
- Vice President Dan Quayle, 11/30/88
More information about the mpiexec
mailing list