Failure to start mpich/p4 on SMPs

Ben Webb ben at bellatrix.pcl.ox.ac.uk
Fri Feb 1 13:58:21 EST 2002


(cc'd to mpibugs, as this appears to be an MPICH problem, rather than with
mpiexec specifically)

On Thu, 31 Jan 2002, Ben Webb wrote:

> 	There appears to be a bug in the interaction between mpich/p4 and 
> mpiexec when running a job that is spread across several SMPs. For 
> example, on our cluster of dual-processor Linux boxes, a job submitted 
> with "-l nodes=cluster1:ppn=2+cluster2:ppn=2" will always fail with a 
> message similar to the following:
> 
> bm_list_27535:  p4_error: lookup_slave_index_by_pid: %d not found: 27533
> p0_27533: (0.098094) net_send: could not write to fd=4, errno = 32

	I've looked into this problem further, and it looks like a bug in 
the setup of the listener process on the master node; the pid of the very 
first process is not propagated to the spawned listener. mpich/p4 then 
gets very upset at the first CONNECTION_REQUEST packet sent to that first 
process. Quite why this bug only manifests itself for me with CHARMM and 
mpiexec, and not with the testcases and/or mpiexec, I don't know, but as 
far as I can tell the bug is present in all cases.

The attached patch appears to correct the problem for me.

P.S. The MPICH "Unsolved problems" page ("Shared Memory in Linux") is 
possibly misleading. If a job crashes, any SysV shared memory or 
semaphores are not released. Once all the slots are used up, subsequent 
jobs will then fail with "semget failed for setnum = 0", in which case the 
solution is to use ipcrm to release those resources. (This is for ch_p4 
with -comm=shared, on a RedHat 7.2 i686 box.)

	Ben
-- 
ben at bellatrix.pcl.ox.ac.uk           http://bellatrix.pcl.ox.ac.uk/~ben/
"Jeeves shimmered out and came back with a telegram."
	- 'Jeeves Takes Charge', P. G. Wodehouse

-------------- next part --------------
diff -Nur mpich-1.2.3/mpid/ch_p4/p4/lib/p4_bm.c mpich-1.2.3-patched/mpid/ch_p4/p4/lib/p4_bm.c
--- mpich-1.2.3/mpid/ch_p4/p4/lib/p4_bm.c	Thu Jan 17 14:27:23 2002
+++ mpich-1.2.3-patched/mpid/ch_p4/p4/lib/p4_bm.c	Fri Feb  1 17:45:30 2002
@@ -507,6 +507,7 @@
 	get_pipe(&end_1, &end_2);
 	p4_local->listener_fd = end_1;
 	listener_info->slave_fd[0] = end_2;
+	listener_info->slave_pid[0] = getpid();
 
 	listener_pid = fork_p4();
 	if (listener_pid < 0)


More information about the mpiexec mailing list