LAM/MPI support for Mpiexec (experimental)

Ben Webb ben at bellatrix.pcl.ox.ac.uk
Thu May 16 14:46:20 EDT 2002


	The attached patch to LAM 6.5.6 makes it work with Mpiexec on a
PBS cluster. I took this route as while I think it would be possible to
have mpiexec do the lamboot / mpirun / lamhalt sequence itself, you lose
the extra functionality that using LAM's tools gives you. (But I see no
reason why this support could not be added to mpiexec's -comm=lam in the
future.)

	Basically, to use LAM at present with rsh/ssh/etc. you first run
"lamboot", which sets up a lamd daemon on each node. Then you use mpirun
to run your jobs, and this talks to the lamd daemons. When you're done,
you use "lamhalt" to shutdown the lamd's. Both mpirun and lamhalt use the
network of lamd daemons to do their business, so do not need rsh/ssh or
Mpiexec; it's only lamboot that requires modification.

Currently, lamboot does the following:-

- Set up a listening TCP socket on node 0
- rsh to each node:
  - run "hboot" on it, telling it the hostname of node 0, and the
    listening port number
  - hboot in turn spawns the "lamd" daemon, which sets up another TCP
    listening socket, and connects back to node 0
  - lamboot then accepts the connection from lamd, and receives the
    port number of lamd's listening socket
- Once all nodes have been contacted, lamboot's TCP socket is closed
- lamboot contacts each lamd via. the port received earlier, and tells
  each one the numbers of the listening ports on every other node
- lamboot's job is now done; the lamd daemons now have full
  connectivity, and take over from here.

Essentially, my patch changes this behaviour to:-

- Set up N listening sockets, one for each node in the cluster
- Create an mpiexec configuration file with the necessary hboot commands
  for each node
- Fork and run mpiexec in the background, passing it the configuration file
- Accept connections from each lamd, and receive a port number from each
- Contact each lamd in the same way as before.

	I have hacked hboot and lamd such that they do not daemonise, so
the spawned mpiexec lasts for the duration of the job. Once lamboot is
completed, lamnodes, mpirun, etc. should work as per normal. When the
job completes, PBS will kill the mpiexec process and thus the spawned
lamd's, although you can do this the "proper" way, by running lamhalt,
which will kill every lamd and thus prompt mpiexec to exit.

	"wipe" is very simple; it just runs "tkill" on each node to kill
off the lamd process. I don't think you should ever need to do this, as
just killing mpiexec should kill the spawned lamd's anyway. The patch
does include code to use mpiexec to run the tkill commands, but it won't
actually work in practice because MOM won't let mpiexec connect to TM
twice (and it'll already be connected once, for the lamboot call).

	I haven't touched recon, lamgrow, or lamshrink, so these won't
work. I don't think they'd be too difficult to fix though, if people
really really wanted them.

	I've hacked lamboot so that the default boot schema is the PBS
nodefile, so running a LAM/MPI job via. PBS and Mpiexec should be as
simple as putting the following in a PBS script:-

lamboot
mpirun C /path/to/mpi/binary
lamhalt

CAVEATS:

- This patch is not 100% perfect yet. Obviously.
- Processes started via. LAM's mpirun don't record their CPU usage, etc.
  with PBS, and neither do they get killed if you "qdel" the PBS job.
  I'm not entirely sure why this is, as the processes are children of
  the TM-spawned lamd process. I think lamd must be calling setsid()
  somewhere. I will investigate further.
- My error handling isn't very robust, so if mpiexec isn't installed, or
  you feed it garbage, bad things will happen.
- Mpiexec reports I/O errors after startup from lamboot. I think this is
  because it inherits lamboot's stdin etc., and am pretty sure that just
  closing these descriptors will solve this problem.
- It'd be nicer if Mpiexec could read a configuration file from stdin,
  so that I didn't have to mess around with temporary files. How doable
  is this?

Any suggestions for improvements to this patch, or comments, welcomed...

	Ben
-- 
ben at bellatrix.pcl.ox.ac.uk           http://bellatrix.pcl.ox.ac.uk/~ben/
"A low voter turnout is an indication of fewer people going to
the polls."
	- Vice President Dan Quayle
-------------- next part --------------
diff -Nur -Xlam.exclude lam-6.5.6/share/boot/lambootagent.c lam-6.5.6-patched/share/boot/lambootagent.c
--- lam-6.5.6/share/boot/lambootagent.c	Mon Nov 19 16:13:45 2001
+++ lam-6.5.6-patched/share/boot/lambootagent.c	Thu May 16 16:21:31 2002
@@ -73,8 +73,8 @@
 int
 lambootagent(struct lamnode *lamnet, int nlamnet, int *nboot, int *nrun)
 {
-	int		agent_port;	/* port number for replies */
-	int		agent_sd;	/* socket for replies */
+	int		agent_port[nlamnet];	/* port number for replies */
+	int		agent_sd[nlamnet];	/* socket for replies */
 	int		boot_sd;	/* connection to new node */
 	int		cmdc;		/* command vector count */
 	int		dlport;
@@ -84,7 +84,12 @@
 	int4		origin;		/* origin node ID */
 	char		**cmdv;		/* command vector */
 	char		*batchid;	/* batch job ID */
+	char		*mpiexec[10];	/* argv for mpiexec invocation */
+	char		tmpnam[80];
+	int		tmpfd;
+	FILE		*fp;
 	unsigned char	*p;
+	pid_t		childpid;
 
 	*nboot = 0;
 	*nrun = 0;
@@ -99,22 +104,43 @@
 	fl_verbose = opt_taken('v');
 	fl_fast = opt_taken('b');
 	fl_close = opt_taken('s');
+
 /*
- * Allocate a server socket and port.
+ * Write mpiexec config file.
  */
-	agent_port = 0;
-	agent_sd = sfh_sock_open_srv_inet_stm(&agent_port);
-	if (agent_sd < 0) {
-	  show_help("boot", "socket-fail", NULL);
-	  return(LAMERROR);
+	strcpy(tmpnam, "/tmp/lam-mpiexec.cfg-XXXXXX");
+	tmpfd = mkstemp(tmpnam);
+	if (tmpfd == -1) {
+		perror("Create temporary file failed");
+		exit(1);
+	}
+	fp = fdopen(tmpfd, "w");
+	if (!fp) {
+		perror("Open of temp file failed");
+		exit(1);
+	}
+	if (fl_verbose) {
+		printf("Using mpiexec config file %s\n", tmpnam);
 	}
+
 /*
- * Make the socket close on exec.
+ * Allocate server sockets and ports.
  */
-	if (fcntl(agent_sd, F_SETFD, 1) == -1) {
-	  show_help(NULL, "system-call-fail", "fcntl (set close-on-exec)", 
-		    NULL);
-	  return(LAMERROR);
+	for (i = 0; i < nlamnet; i++) {
+	  agent_port[i] = 0;
+	  agent_sd[i] = sfh_sock_open_srv_inet_stm(&agent_port[i]);
+	  if (agent_sd[i] < 0) {
+	    show_help("boot", "socket-fail", NULL);
+	    return(LAMERROR);
+	  }
+/*
+ * Make the sockets close on exec.
+ */
+	  if (fcntl(agent_sd[i], F_SETFD, 1) == -1) {
+	    show_help(NULL, "system-call-fail", "fcntl (set close-on-exec)", 
+		      NULL);
+	    return(LAMERROR);
+	  }
 	}
 /*
  * Find the local node.
@@ -160,18 +186,14 @@
 /*
  * Invoke hboot on the new host.
  */
-		cmdc = 0;
-		cmdv = 0;
-		argvadd(&cmdc, &cmdv, DEFTHBOOT);
-		argvadd(&cmdc, &cmdv, "-t");
-		argvadd(&cmdc, &cmdv, "-c");
-		argvadd(&cmdc, &cmdv, "lam-conf.lam");
+		fprintf(fp, "%s : %s -t -c lam-conf.lam", lamnet[i].lnd_hname,
+						          DEFTHBOOT);
 
 		if (fl_debug) {
-			argvadd(&cmdc, &cmdv, "-d");
+			fprintf(fp, " -d");
 		}
 		if (fl_verbose) {
-			argvadd(&cmdc, &cmdv, "-v");
+			fprintf(fp, " -v");
 		}
 /*
  * If remote node, close stdio of processes, unless forced by the
@@ -180,7 +202,7 @@
  * hboot/lamd on somenode to close their stdio so that rsh can finish.
  */
 		if (i != local || fl_close) {
-			argvadd(&cmdc, &cmdv, "-s");
+			fprintf(fp, " -s");
 		}
 /*
  * If this is under a batch system, pass the -b to both hboot and to
@@ -188,26 +210,22 @@
  */
 		batchid = get_batchid();
 		if (strlen(batchid) > 0) {
-		  argvadd(&cmdc, &cmdv, "-b");
-		  argvadd(&cmdc, &cmdv, batchid); 
+		  fprintf(fp, " -b %s", batchid);
 		}
 /*
  * Override the $inet_topo variable.
  */
 		p = (unsigned char *) &lamnet[local].lnd_addr.sin_addr;
-		argvadd(&cmdc, &cmdv, "-I");
-		sprintf(buf, "%c%s-H %u.%u.%u.%u -P %d -n %d -o %d %s %s%c",
-			i == local ? ' ' : '"',
+		fprintf(fp, " -I \" %s-H %u.%u.%u.%u -P %d -n %d -o %d %s %s\"",
 			opt_taken('x') ? "-x " : "",
 			(unsigned) p[0], (unsigned) p[1],
 			(unsigned) p[2], (unsigned) p[3],
-			agent_port,
+			agent_port[i],
 			i,
 			origin,
 			(strlen(batchid) == 0 ? " " : "-b"),
-			(strlen(batchid) == 0 ? " " : batchid),
-			i == local ? ' ' : '"');
-		argvadd(&cmdc, &cmdv, buf);
+			(strlen(batchid) == 0 ? " " : batchid));
+		fprintf(fp, "\n");
 
 		VERBOSE("Executing %s on n%d (%s - %d CPU%s)...\n", 
 			DEFTHBOOT, i, lamnet[i].lnd_hname,
@@ -215,48 +233,35 @@
 			(lamnet[i].lnd_ncpus > 1) ? "s" : "");
 
 		(*nboot)++;
+	}
 
-		if (i == local) {
-		        if (fl_debug) {
-			  int j;
-			  
-			  printf("lamboot: attempting to execute \"");
-			  for (j = 0; j < cmdc; j++) {
-			    if (j > 0)
-			      printf(" ");
-			    if (strchr(cmdv[j], ' ') != NULL)
-			      printf("\"%s\"", cmdv[j]);
-			    else
-			      printf("%s", cmdv[j]);
-			  }
-			  printf("\"\n");
-			}
-			r = _lam_few(cmdv);
-
-			if (r) {
-				(*nboot)--;
-				errno = r;
-				show_help("boot", "fork-fail", cmdv[0], NULL);
-				argvfree(cmdv);
-				return(LAMERROR);
-			}
-		} else {
-			r = inetexec(lamnet[i].lnd_hname, lamnet[i].lnd_uname,
-				     cmdv, (fl_debug ? "lamboot" : NULL),
-				     fl_fast);
-
-			if (r) {
-				(*nboot)--;
-				argvfree(cmdv);
-				/* inetexec will display errors if it
-                                   fails */
-				return(LAMERROR);
-			}
-		}
+/*
+ * Fire off mpiexec to start the hboot processes.
+ */
+	fclose(fp);
+	mpiexec[0] = "/usr/bin/mpiexec";
+	mpiexec[1] = "-comm=none";
+	mpiexec[2] = "-config";
+	mpiexec[3] = tmpnam;
+	mpiexec[4] = NULL;
+	childpid = fork();
+	if (childpid == -1) {
+		lamfail("lambootagent fork failed");
+	} else if (childpid == 0) {
+		execv(mpiexec[0], mpiexec);
+		lamfail("execv failed");
+	}
+
+	for (i = 0; i < nlamnet; ++i) {
+/*
+ * Skip nodes that are invalid or already booted.
+ */
+		if ((lamnet[i].lnd_nodeid == NOTNODEID) ||
+				!(lamnet[i].lnd_type & NT_BOOT)) continue;
 /*
  * Accept a connection from the new host.
  */
-		boot_sd = sfh_sock_accept_tmout(agent_sd, LAM_TO_BOOT);
+		boot_sd = sfh_sock_accept_tmout(agent_sd[i], LAM_TO_BOOT);
 		if (boot_sd < 0) return(LAMERROR);
 /*
  * Read the new host port numbers.
@@ -272,7 +277,17 @@
 		(*nrun)++;
 	}
 
-	if (close(agent_sd)) return(LAMERROR);
+	if (fl_verbose) {
+		printf("all nodes connected\n");
+	}
+/*
+ * mpiexec must have fired up by now, so we can remove the config file
+ */
+	unlink(tmpnam);
+
+	for (i = 0; i < nlamnet; ++i) {
+		if (close(agent_sd[i])) return(LAMERROR);
+	}
 
 	if (fl_verbose) {
 		nodespin_init("topology");
diff -Nur -Xlam.exclude lam-6.5.6/tools/hboot/hboot.c lam-6.5.6-patched/tools/hboot/hboot.c
--- lam-6.5.6/tools/hboot/hboot.c	Mon Nov 19 16:14:48 2001
+++ lam-6.5.6-patched/tools/hboot/hboot.c	Thu May 16 16:15:52 2002
@@ -99,6 +99,8 @@
 	char		buf[32];	/* formatting buffer */
 	char		*full;		/* full pathname */
 	char		*tail;		/* tail of full pathname */
+	char **pt;
+	int		status;
 
 	/* Ensure that we are not root */
 
@@ -245,7 +247,7 @@
 	  exit(errno);
 	}
 
-#if 1
+#if 0
 	/* Comment this out to make the TM extensions to PBS work
            nicely -- everything will be in one session, so TM can kill
            it when it dies. */
@@ -304,6 +306,7 @@
 			if (fl_debug) {
 			  printf("hboot: attempting to execute \n");
 			}
+
 			execvp(p->psc_argv[0], p->psc_argv);
 			exit(errno);
 		}
@@ -323,6 +326,7 @@
 
 				printf("\n");
 			}
+			wait(&status);
 		}
 
 		if (p->psc_delay > 0) {
diff -Nur -Xlam.exclude lam-6.5.6/tools/lamboot/lamboot.c lam-6.5.6-patched/tools/lamboot/lamboot.c
--- lam-6.5.6/tools/lamboot/lamboot.c	Mon Nov 19 16:14:49 2001
+++ lam-6.5.6-patched/tools/lamboot/lamboot.c	Thu May 16 13:13:25 2002
@@ -271,6 +271,7 @@
  */
 	if (cmdc == 2) {
 		fname = cmdv[1];
+	} else if ((fname = getenv("PBS_NODEFILE"))) {
 	} else if ((fname = getenv("LAMBHOST"))) {
 	} else if ((fname = getenv("TROLLIUSBHOST"))) {
 	} else {
diff -Nur -Xlam.exclude lam-6.5.6/tools/wipe/wipe.c lam-6.5.6-patched/tools/wipe/wipe.c
--- lam-6.5.6/tools/wipe/wipe.c	Mon Nov 19 16:14:50 2001
+++ lam-6.5.6-patched/tools/wipe/wipe.c	Thu May 16 15:54:46 2002
@@ -86,6 +86,8 @@
 	int		badhost;	/* bad host index */
 	int		r, j, success = 1;
 	struct lamnode	*lamnet;	/* network description array */
+	char		tmpnam[80];
+	int		tmpfd;
 
 	/* Ensure that we are not root */
 
@@ -192,15 +194,23 @@
 	} else {
 	  DBUG("wipe: killing LAM from a non-member machine\n");
 	}
+
 /*
- * Build the tkill command.
+ * Write mpiexec config file.
  */
-	cmdn = 0;
-	cmdv = 0;
-	argvadd(&cmdn, &cmdv, DEFTTKILL);
-
-	if (fl_debug) {
-		argvadd(&cmdn, &cmdv, "-d");
+	strcpy(tmpnam, "/tmp/lam-mpiexec.cfg-XXXXXX");
+	tmpfd = mkstemp(tmpnam);
+	if (tmpfd == -1) {
+		perror("Create temporary file failed");
+		exit(1);
+	}
+	fp = fdopen(tmpfd, "w");
+	if (!fp) {
+		perror("Open of temp file failed");
+		exit(1);
+	}
+	if (fl_verbose) {
+		printf("Using mpiexec config file %s\n", tmpnam);
 	}
 
 	if (opt_taken('n')) {
@@ -208,71 +218,46 @@
 	} else {
 		limit = -1;
 	}
+
+	for (i = 0; (i < nlamnet) && limit; ++i) {
+		if (limit > 0) --limit;
+		fprintf(fp, lamnet[i].lnd_hname);
+	}
+	fprintf(fp, " : %s", DEFTTKILL);
+	if (fl_debug) {
+		fprintf(fp, " -d");
+	}
+
 /*
  * If we're running ounder a batch system, we have to propogate the
  * socket name to all the remote tkill instances.
  */
 	batchid = get_batchid();
 	if (strlen(batchid) > 0) {
-	  argvadd(&cmdn, &cmdv, "-b");
-	  argvadd(&cmdn, &cmdv, batchid);
+	  fprintf(fp, " -b %s", batchid);
 	}
+
 /*
- * Loop over all host nodes.
+ * Build the mpiexec command.
  */
-	global_ret = 0;
-
-	for (i = 0; (i < nlamnet) && limit; ++i) {
-
-		if (limit > 0) --limit;
-
-		VERBOSE("Executing %s on n%d (%s)...\n", DEFTTKILL,
-				lamnet[i].lnd_nodeid, lamnet[i].lnd_hname);
-
-                if (fl_debug) {
-		  printf("wipe: attempting to launch \"");
-		  for (j = 0; j < cmdn; j++) {
-		    if (j > 0)
-		      printf(" ");
-		    printf("%s", cmdv[j]);
-		  }
-		  printf("\" ");
-		}
-
-		if (lamnet[i].lnd_type & NT_ORIGIN) {
-		        DBUG("(local execution)\n");
-			r = _lam_few(cmdv);
-
-			if (r) {
-				errno = r;
-			}
-		} else {
-		        DBUG("(remote execution)\n");
-			r = inetexec(lamnet[i].lnd_hname,
-				     lamnet[i].lnd_uname, cmdv, 
-				     (fl_debug ? "wipe" : NULL),
-				     fl_fast);
-		}
-
-		if (r) {
-			fprintf(stderr, "wipe: %s failed on n%d (%s)\n",
-					DEFTTKILL, lamnet[i].lnd_nodeid,
-					lamnet[i].lnd_hname);
-
-			if (errno != EUNKNOWN) {
-				terror("wipe");
-			} else
-			  show_help(NULL, "unknown", NULL);
-
-			global_ret = errno;
-			success = 0;
-		}
-	}
-
-	if (success) {
-	  DBUG("wipe completed successfully\n");
+	cmdn = 0;
+	cmdv = 0;
+	argvadd(&cmdn, &cmdv, "/usr/bin/mpiexec");
+	argvadd(&cmdn, &cmdv, "-comm=none");
+	argvadd(&cmdn, &cmdv, "-config");
+	argvadd(&cmdn, &cmdv, tmpnam);
+
+	r = _lam_few(cmdv);
+	unlink(tmpnam);
+
+	if (r) {
+		errno = r;
+		if (errno != EUNKNOWN) {
+			terror("wipe");
+		} else
+		  show_help(NULL, "unknown", NULL);
 	} else {
-	  DBUG("wipe did NOT complete successfully\n");
+		DBUG("wipe completed successfully\n");
 	}
 
 	argvfree(cmdv);
@@ -308,6 +293,7 @@
  */
 	if (cmdc == 2) {
 		bhost = cmdv[1];
+	} else if ((bhost = getenv("PBS_NODEFILE"))) {
 	} else if ((bhost = getenv("LAMBHOST"))) {
 	} else if ((bhost = getenv("TROLLIUSBHOST"))) {
 	} else {


More information about the mpiexec mailing list