LAM/MPI support for Mpiexec (experimental)
Ben Webb
ben at bellatrix.pcl.ox.ac.uk
Thu May 16 14:46:20 EDT 2002
The attached patch to LAM 6.5.6 makes it work with Mpiexec on a
PBS cluster. I took this route as while I think it would be possible to
have mpiexec do the lamboot / mpirun / lamhalt sequence itself, you lose
the extra functionality that using LAM's tools gives you. (But I see no
reason why this support could not be added to mpiexec's -comm=lam in the
future.)
Basically, to use LAM at present with rsh/ssh/etc. you first run
"lamboot", which sets up a lamd daemon on each node. Then you use mpirun
to run your jobs, and this talks to the lamd daemons. When you're done,
you use "lamhalt" to shutdown the lamd's. Both mpirun and lamhalt use the
network of lamd daemons to do their business, so do not need rsh/ssh or
Mpiexec; it's only lamboot that requires modification.
Currently, lamboot does the following:-
- Set up a listening TCP socket on node 0
- rsh to each node:
- run "hboot" on it, telling it the hostname of node 0, and the
listening port number
- hboot in turn spawns the "lamd" daemon, which sets up another TCP
listening socket, and connects back to node 0
- lamboot then accepts the connection from lamd, and receives the
port number of lamd's listening socket
- Once all nodes have been contacted, lamboot's TCP socket is closed
- lamboot contacts each lamd via. the port received earlier, and tells
each one the numbers of the listening ports on every other node
- lamboot's job is now done; the lamd daemons now have full
connectivity, and take over from here.
Essentially, my patch changes this behaviour to:-
- Set up N listening sockets, one for each node in the cluster
- Create an mpiexec configuration file with the necessary hboot commands
for each node
- Fork and run mpiexec in the background, passing it the configuration file
- Accept connections from each lamd, and receive a port number from each
- Contact each lamd in the same way as before.
I have hacked hboot and lamd such that they do not daemonise, so
the spawned mpiexec lasts for the duration of the job. Once lamboot is
completed, lamnodes, mpirun, etc. should work as per normal. When the
job completes, PBS will kill the mpiexec process and thus the spawned
lamd's, although you can do this the "proper" way, by running lamhalt,
which will kill every lamd and thus prompt mpiexec to exit.
"wipe" is very simple; it just runs "tkill" on each node to kill
off the lamd process. I don't think you should ever need to do this, as
just killing mpiexec should kill the spawned lamd's anyway. The patch
does include code to use mpiexec to run the tkill commands, but it won't
actually work in practice because MOM won't let mpiexec connect to TM
twice (and it'll already be connected once, for the lamboot call).
I haven't touched recon, lamgrow, or lamshrink, so these won't
work. I don't think they'd be too difficult to fix though, if people
really really wanted them.
I've hacked lamboot so that the default boot schema is the PBS
nodefile, so running a LAM/MPI job via. PBS and Mpiexec should be as
simple as putting the following in a PBS script:-
lamboot
mpirun C /path/to/mpi/binary
lamhalt
CAVEATS:
- This patch is not 100% perfect yet. Obviously.
- Processes started via. LAM's mpirun don't record their CPU usage, etc.
with PBS, and neither do they get killed if you "qdel" the PBS job.
I'm not entirely sure why this is, as the processes are children of
the TM-spawned lamd process. I think lamd must be calling setsid()
somewhere. I will investigate further.
- My error handling isn't very robust, so if mpiexec isn't installed, or
you feed it garbage, bad things will happen.
- Mpiexec reports I/O errors after startup from lamboot. I think this is
because it inherits lamboot's stdin etc., and am pretty sure that just
closing these descriptors will solve this problem.
- It'd be nicer if Mpiexec could read a configuration file from stdin,
so that I didn't have to mess around with temporary files. How doable
is this?
Any suggestions for improvements to this patch, or comments, welcomed...
Ben
--
ben at bellatrix.pcl.ox.ac.uk http://bellatrix.pcl.ox.ac.uk/~ben/
"A low voter turnout is an indication of fewer people going to
the polls."
- Vice President Dan Quayle
-------------- next part --------------
diff -Nur -Xlam.exclude lam-6.5.6/share/boot/lambootagent.c lam-6.5.6-patched/share/boot/lambootagent.c
--- lam-6.5.6/share/boot/lambootagent.c Mon Nov 19 16:13:45 2001
+++ lam-6.5.6-patched/share/boot/lambootagent.c Thu May 16 16:21:31 2002
@@ -73,8 +73,8 @@
int
lambootagent(struct lamnode *lamnet, int nlamnet, int *nboot, int *nrun)
{
- int agent_port; /* port number for replies */
- int agent_sd; /* socket for replies */
+ int agent_port[nlamnet]; /* port number for replies */
+ int agent_sd[nlamnet]; /* socket for replies */
int boot_sd; /* connection to new node */
int cmdc; /* command vector count */
int dlport;
@@ -84,7 +84,12 @@
int4 origin; /* origin node ID */
char **cmdv; /* command vector */
char *batchid; /* batch job ID */
+ char *mpiexec[10]; /* argv for mpiexec invocation */
+ char tmpnam[80];
+ int tmpfd;
+ FILE *fp;
unsigned char *p;
+ pid_t childpid;
*nboot = 0;
*nrun = 0;
@@ -99,22 +104,43 @@
fl_verbose = opt_taken('v');
fl_fast = opt_taken('b');
fl_close = opt_taken('s');
+
/*
- * Allocate a server socket and port.
+ * Write mpiexec config file.
*/
- agent_port = 0;
- agent_sd = sfh_sock_open_srv_inet_stm(&agent_port);
- if (agent_sd < 0) {
- show_help("boot", "socket-fail", NULL);
- return(LAMERROR);
+ strcpy(tmpnam, "/tmp/lam-mpiexec.cfg-XXXXXX");
+ tmpfd = mkstemp(tmpnam);
+ if (tmpfd == -1) {
+ perror("Create temporary file failed");
+ exit(1);
+ }
+ fp = fdopen(tmpfd, "w");
+ if (!fp) {
+ perror("Open of temp file failed");
+ exit(1);
+ }
+ if (fl_verbose) {
+ printf("Using mpiexec config file %s\n", tmpnam);
}
+
/*
- * Make the socket close on exec.
+ * Allocate server sockets and ports.
*/
- if (fcntl(agent_sd, F_SETFD, 1) == -1) {
- show_help(NULL, "system-call-fail", "fcntl (set close-on-exec)",
- NULL);
- return(LAMERROR);
+ for (i = 0; i < nlamnet; i++) {
+ agent_port[i] = 0;
+ agent_sd[i] = sfh_sock_open_srv_inet_stm(&agent_port[i]);
+ if (agent_sd[i] < 0) {
+ show_help("boot", "socket-fail", NULL);
+ return(LAMERROR);
+ }
+/*
+ * Make the sockets close on exec.
+ */
+ if (fcntl(agent_sd[i], F_SETFD, 1) == -1) {
+ show_help(NULL, "system-call-fail", "fcntl (set close-on-exec)",
+ NULL);
+ return(LAMERROR);
+ }
}
/*
* Find the local node.
@@ -160,18 +186,14 @@
/*
* Invoke hboot on the new host.
*/
- cmdc = 0;
- cmdv = 0;
- argvadd(&cmdc, &cmdv, DEFTHBOOT);
- argvadd(&cmdc, &cmdv, "-t");
- argvadd(&cmdc, &cmdv, "-c");
- argvadd(&cmdc, &cmdv, "lam-conf.lam");
+ fprintf(fp, "%s : %s -t -c lam-conf.lam", lamnet[i].lnd_hname,
+ DEFTHBOOT);
if (fl_debug) {
- argvadd(&cmdc, &cmdv, "-d");
+ fprintf(fp, " -d");
}
if (fl_verbose) {
- argvadd(&cmdc, &cmdv, "-v");
+ fprintf(fp, " -v");
}
/*
* If remote node, close stdio of processes, unless forced by the
@@ -180,7 +202,7 @@
* hboot/lamd on somenode to close their stdio so that rsh can finish.
*/
if (i != local || fl_close) {
- argvadd(&cmdc, &cmdv, "-s");
+ fprintf(fp, " -s");
}
/*
* If this is under a batch system, pass the -b to both hboot and to
@@ -188,26 +210,22 @@
*/
batchid = get_batchid();
if (strlen(batchid) > 0) {
- argvadd(&cmdc, &cmdv, "-b");
- argvadd(&cmdc, &cmdv, batchid);
+ fprintf(fp, " -b %s", batchid);
}
/*
* Override the $inet_topo variable.
*/
p = (unsigned char *) &lamnet[local].lnd_addr.sin_addr;
- argvadd(&cmdc, &cmdv, "-I");
- sprintf(buf, "%c%s-H %u.%u.%u.%u -P %d -n %d -o %d %s %s%c",
- i == local ? ' ' : '"',
+ fprintf(fp, " -I \" %s-H %u.%u.%u.%u -P %d -n %d -o %d %s %s\"",
opt_taken('x') ? "-x " : "",
(unsigned) p[0], (unsigned) p[1],
(unsigned) p[2], (unsigned) p[3],
- agent_port,
+ agent_port[i],
i,
origin,
(strlen(batchid) == 0 ? " " : "-b"),
- (strlen(batchid) == 0 ? " " : batchid),
- i == local ? ' ' : '"');
- argvadd(&cmdc, &cmdv, buf);
+ (strlen(batchid) == 0 ? " " : batchid));
+ fprintf(fp, "\n");
VERBOSE("Executing %s on n%d (%s - %d CPU%s)...\n",
DEFTHBOOT, i, lamnet[i].lnd_hname,
@@ -215,48 +233,35 @@
(lamnet[i].lnd_ncpus > 1) ? "s" : "");
(*nboot)++;
+ }
- if (i == local) {
- if (fl_debug) {
- int j;
-
- printf("lamboot: attempting to execute \"");
- for (j = 0; j < cmdc; j++) {
- if (j > 0)
- printf(" ");
- if (strchr(cmdv[j], ' ') != NULL)
- printf("\"%s\"", cmdv[j]);
- else
- printf("%s", cmdv[j]);
- }
- printf("\"\n");
- }
- r = _lam_few(cmdv);
-
- if (r) {
- (*nboot)--;
- errno = r;
- show_help("boot", "fork-fail", cmdv[0], NULL);
- argvfree(cmdv);
- return(LAMERROR);
- }
- } else {
- r = inetexec(lamnet[i].lnd_hname, lamnet[i].lnd_uname,
- cmdv, (fl_debug ? "lamboot" : NULL),
- fl_fast);
-
- if (r) {
- (*nboot)--;
- argvfree(cmdv);
- /* inetexec will display errors if it
- fails */
- return(LAMERROR);
- }
- }
+/*
+ * Fire off mpiexec to start the hboot processes.
+ */
+ fclose(fp);
+ mpiexec[0] = "/usr/bin/mpiexec";
+ mpiexec[1] = "-comm=none";
+ mpiexec[2] = "-config";
+ mpiexec[3] = tmpnam;
+ mpiexec[4] = NULL;
+ childpid = fork();
+ if (childpid == -1) {
+ lamfail("lambootagent fork failed");
+ } else if (childpid == 0) {
+ execv(mpiexec[0], mpiexec);
+ lamfail("execv failed");
+ }
+
+ for (i = 0; i < nlamnet; ++i) {
+/*
+ * Skip nodes that are invalid or already booted.
+ */
+ if ((lamnet[i].lnd_nodeid == NOTNODEID) ||
+ !(lamnet[i].lnd_type & NT_BOOT)) continue;
/*
* Accept a connection from the new host.
*/
- boot_sd = sfh_sock_accept_tmout(agent_sd, LAM_TO_BOOT);
+ boot_sd = sfh_sock_accept_tmout(agent_sd[i], LAM_TO_BOOT);
if (boot_sd < 0) return(LAMERROR);
/*
* Read the new host port numbers.
@@ -272,7 +277,17 @@
(*nrun)++;
}
- if (close(agent_sd)) return(LAMERROR);
+ if (fl_verbose) {
+ printf("all nodes connected\n");
+ }
+/*
+ * mpiexec must have fired up by now, so we can remove the config file
+ */
+ unlink(tmpnam);
+
+ for (i = 0; i < nlamnet; ++i) {
+ if (close(agent_sd[i])) return(LAMERROR);
+ }
if (fl_verbose) {
nodespin_init("topology");
diff -Nur -Xlam.exclude lam-6.5.6/tools/hboot/hboot.c lam-6.5.6-patched/tools/hboot/hboot.c
--- lam-6.5.6/tools/hboot/hboot.c Mon Nov 19 16:14:48 2001
+++ lam-6.5.6-patched/tools/hboot/hboot.c Thu May 16 16:15:52 2002
@@ -99,6 +99,8 @@
char buf[32]; /* formatting buffer */
char *full; /* full pathname */
char *tail; /* tail of full pathname */
+ char **pt;
+ int status;
/* Ensure that we are not root */
@@ -245,7 +247,7 @@
exit(errno);
}
-#if 1
+#if 0
/* Comment this out to make the TM extensions to PBS work
nicely -- everything will be in one session, so TM can kill
it when it dies. */
@@ -304,6 +306,7 @@
if (fl_debug) {
printf("hboot: attempting to execute \n");
}
+
execvp(p->psc_argv[0], p->psc_argv);
exit(errno);
}
@@ -323,6 +326,7 @@
printf("\n");
}
+ wait(&status);
}
if (p->psc_delay > 0) {
diff -Nur -Xlam.exclude lam-6.5.6/tools/lamboot/lamboot.c lam-6.5.6-patched/tools/lamboot/lamboot.c
--- lam-6.5.6/tools/lamboot/lamboot.c Mon Nov 19 16:14:49 2001
+++ lam-6.5.6-patched/tools/lamboot/lamboot.c Thu May 16 13:13:25 2002
@@ -271,6 +271,7 @@
*/
if (cmdc == 2) {
fname = cmdv[1];
+ } else if ((fname = getenv("PBS_NODEFILE"))) {
} else if ((fname = getenv("LAMBHOST"))) {
} else if ((fname = getenv("TROLLIUSBHOST"))) {
} else {
diff -Nur -Xlam.exclude lam-6.5.6/tools/wipe/wipe.c lam-6.5.6-patched/tools/wipe/wipe.c
--- lam-6.5.6/tools/wipe/wipe.c Mon Nov 19 16:14:50 2001
+++ lam-6.5.6-patched/tools/wipe/wipe.c Thu May 16 15:54:46 2002
@@ -86,6 +86,8 @@
int badhost; /* bad host index */
int r, j, success = 1;
struct lamnode *lamnet; /* network description array */
+ char tmpnam[80];
+ int tmpfd;
/* Ensure that we are not root */
@@ -192,15 +194,23 @@
} else {
DBUG("wipe: killing LAM from a non-member machine\n");
}
+
/*
- * Build the tkill command.
+ * Write mpiexec config file.
*/
- cmdn = 0;
- cmdv = 0;
- argvadd(&cmdn, &cmdv, DEFTTKILL);
-
- if (fl_debug) {
- argvadd(&cmdn, &cmdv, "-d");
+ strcpy(tmpnam, "/tmp/lam-mpiexec.cfg-XXXXXX");
+ tmpfd = mkstemp(tmpnam);
+ if (tmpfd == -1) {
+ perror("Create temporary file failed");
+ exit(1);
+ }
+ fp = fdopen(tmpfd, "w");
+ if (!fp) {
+ perror("Open of temp file failed");
+ exit(1);
+ }
+ if (fl_verbose) {
+ printf("Using mpiexec config file %s\n", tmpnam);
}
if (opt_taken('n')) {
@@ -208,71 +218,46 @@
} else {
limit = -1;
}
+
+ for (i = 0; (i < nlamnet) && limit; ++i) {
+ if (limit > 0) --limit;
+ fprintf(fp, lamnet[i].lnd_hname);
+ }
+ fprintf(fp, " : %s", DEFTTKILL);
+ if (fl_debug) {
+ fprintf(fp, " -d");
+ }
+
/*
* If we're running ounder a batch system, we have to propogate the
* socket name to all the remote tkill instances.
*/
batchid = get_batchid();
if (strlen(batchid) > 0) {
- argvadd(&cmdn, &cmdv, "-b");
- argvadd(&cmdn, &cmdv, batchid);
+ fprintf(fp, " -b %s", batchid);
}
+
/*
- * Loop over all host nodes.
+ * Build the mpiexec command.
*/
- global_ret = 0;
-
- for (i = 0; (i < nlamnet) && limit; ++i) {
-
- if (limit > 0) --limit;
-
- VERBOSE("Executing %s on n%d (%s)...\n", DEFTTKILL,
- lamnet[i].lnd_nodeid, lamnet[i].lnd_hname);
-
- if (fl_debug) {
- printf("wipe: attempting to launch \"");
- for (j = 0; j < cmdn; j++) {
- if (j > 0)
- printf(" ");
- printf("%s", cmdv[j]);
- }
- printf("\" ");
- }
-
- if (lamnet[i].lnd_type & NT_ORIGIN) {
- DBUG("(local execution)\n");
- r = _lam_few(cmdv);
-
- if (r) {
- errno = r;
- }
- } else {
- DBUG("(remote execution)\n");
- r = inetexec(lamnet[i].lnd_hname,
- lamnet[i].lnd_uname, cmdv,
- (fl_debug ? "wipe" : NULL),
- fl_fast);
- }
-
- if (r) {
- fprintf(stderr, "wipe: %s failed on n%d (%s)\n",
- DEFTTKILL, lamnet[i].lnd_nodeid,
- lamnet[i].lnd_hname);
-
- if (errno != EUNKNOWN) {
- terror("wipe");
- } else
- show_help(NULL, "unknown", NULL);
-
- global_ret = errno;
- success = 0;
- }
- }
-
- if (success) {
- DBUG("wipe completed successfully\n");
+ cmdn = 0;
+ cmdv = 0;
+ argvadd(&cmdn, &cmdv, "/usr/bin/mpiexec");
+ argvadd(&cmdn, &cmdv, "-comm=none");
+ argvadd(&cmdn, &cmdv, "-config");
+ argvadd(&cmdn, &cmdv, tmpnam);
+
+ r = _lam_few(cmdv);
+ unlink(tmpnam);
+
+ if (r) {
+ errno = r;
+ if (errno != EUNKNOWN) {
+ terror("wipe");
+ } else
+ show_help(NULL, "unknown", NULL);
} else {
- DBUG("wipe did NOT complete successfully\n");
+ DBUG("wipe completed successfully\n");
}
argvfree(cmdv);
@@ -308,6 +293,7 @@
*/
if (cmdc == 2) {
bhost = cmdv[1];
+ } else if ((bhost = getenv("PBS_NODEFILE"))) {
} else if ((bhost = getenv("LAMBHOST"))) {
} else if ((bhost = getenv("TROLLIUSBHOST"))) {
} else {
More information about the mpiexec
mailing list