Failure to start mpich/p4 on SMPs

Ben Webb ben at bellatrix.pcl.ox.ac.uk
Thu Jan 31 13:18:26 EST 2002


	There appears to be a bug in the interaction between mpich/p4 and 
mpiexec when running a job that is spread across several SMPs. For 
example, on our cluster of dual-processor Linux boxes, a job submitted 
with "-l nodes=cluster1:ppn=2+cluster2:ppn=2" will always fail with a 
message similar to the following:

bm_list_27535:  p4_error: lookup_slave_index_by_pid: %d not found: 27533
p0_27533: (0.098094) net_send: could not write to fd=4, errno = 32

"-l nodes=cluster1:ppn=1+cluster2:ppn=2" also fails, but
"-l nodes=cluster1:ppn=2+cluster2:ppn=1" works quite happily. In fact, 
failure seems to occur whenever a slave node has two processes attempting 
to run, so I can only assume this is some kind of race condition 
manifesting itself (although I thought this was what the new listener 
code was supposed to remedy). Has anybody else observed this behaviour?

	I'm using mpiexec-0.65 (--with-smp-size=2) and mpich-1.2.3 
(--with-device=ch_p4 --with-comm=shared) under RedHat 7.2. "mpirun" with 
the PBS node file (via. serv_p4) runs with no problems whatsoever. 
mpiexec with --comm=none also works OK. I am also unable to reproduce this 
behaviour with the MPICH test programs - it only occurs with the parallel 
version of CHARMM, built against mpich-1.2.3 - which is what leads to me 
to suspect a race condition.

Ideas?

	Ben
-- 
ben at bellatrix.pcl.ox.ac.uk           http://bellatrix.pcl.ox.ac.uk/~ben/
"Verbosity leads to unclear, inarticulate things."
	- Vice President Dan Quayle, 11/30/88





More information about the mpiexec mailing list