Patch for TotalView support with MPICH2/MVAPICH2

Pete Wyckoff pw at osc.edu
Fri Mar 21 13:44:00 EDT 2008


frank.mietke at informatik.tu-chemnitz.de wrote on Thu, 20 Mar 2008 14:23 +0100:
> here's my patch for adding TotalView support for MPICH2/MVAPICH2. I've tested it
> on our system and it worked well. To work properly it is recommended but not
> necessary to set the following environment variables:
> 
> TOTALVIEW=<totalview executable>
> TVDSVR=<ssh|rsh>
> 
> MPICH2/MVAPICH2 have to be configured with --enable-debuginfo.

Great stuff.  I couldn't keep from meddling with it though.  Below
is the patch that I'm currently thinking about.  (You'll get a small
reject in Makefile.in unless you pull SVN first.)  It could go in
now, but I thought you might like to look this over, fix my bugs,
and maybe make some improvements.

Some notes:

tvready is not used, so I removed it.  Not sure if this is
important.

I moved more of the totalview work into tv_attach.c, hopefully
without breaking it, to keep start_tasks a bit cleaner---it is
getting big.

Does this:

> +		growstr_printf(g, "printf \"%%10d\" $$ > /dev/tcp/%s/%d; if test -d \"%s\"; then cd \"%s\"; fi; exec %s -c ",

work on all sorts of machines?  Just linux?  I've always been
baffled at how "ls /dev/tcp" shows nothing, but this magic happens.
It is implement in glibc, right?  An alternate way would be to use
the "nc" program, but I don't have a preference for that.

Are both these:

> +	env_add_int("PMI_TOTALVIEW", 1);
> +	env_add_int("MPIEXEC_DEBUG", 1);

needed by mpich2?

I made these strings const:

> +typedef struct {
> +	char *host_name;
> +	char *executable_name;
> +	int	pid;
> +} MPIR_PROCDESC;

so we don't have to strdup() quite so much.  Hopefully totalview is
okay with this and doesn't try to change them.

Would you be willing to attempt some documentation?  I don't know if
it should go in README or mpiexec.1 or just at the top of
tv_attach.c.  But I'm a bit confused by the model.  Old totalview
support used to be solely in the app (MPI library) itself.  All
mpiexec had to do was to invoke totalview with some args.  This new
support spawns off a totalview to attach to mpiexec itself!  The
info from your mail about how to set things up too.

Do we need to worry about killing off totalview ever?

What happens at MPI_Abort()?  Does everything go away cleanly?

The one-at-a-time spawn-then-attach model will be very slow for
large values of "-np".  I'm not going to insist that that process be
parallelized, although maybe someone will be bothered enough to do
it someday.

I take it that MPI_Conn_spawn() will not work with totalview?  This
seems like a fundamental limitation in their interface.  Maybe we
should just document that in tv_attach.c.

		-- Pete


Index: mpiexec.h
===================================================================
--- mpiexec.h	(revision 418)
+++ mpiexec.h	(working copy)
@@ -186,6 +186,7 @@ extern int numtasks;  /* actual number o
 extern int numspawns;  /* 1 + number of times MPI_Spawn called */
 extern struct passwd *pswd;  /* used for home dir, shell, user name */
 extern struct sockaddr_in myaddr;  /* for out-of-band MPI lib startup */
+extern char *tvname;  /* name of totalview executable, possibly from env var */
 
 /* concurrent.c */
 extern int concurrent_master;  /* if first mpiexec to run in job */
Index: Makefile.in
===================================================================
--- Makefile.in	(revision 418)
+++ Makefile.in	(working copy)
@@ -7,8 +7,8 @@
 #
 SRC   = mpiexec.c get_hosts.c start_tasks.c task.c event.c util.c config.c \
 	stdio.c growstr.c pmi.c gm.c ib.c psm.c p4.c rai.c concurrent.c \
-	exedist.c spawn.c
-H     = mpiexec.h util.h growstr.h list.h
+	exedist.c spawn.c tv_attach.c
+H     = mpiexec.h util.h growstr.h list.h tv_attach.h
 OTHER = ChangeLog LICENSE README mpiexec.1 proc-relations.fig \
 	hello.c hellof.f hellomp.f redir-helper.c \
 	runtests.pl README.lam
Index: start_tasks.c
===================================================================
--- start_tasks.c	(revision 418)
+++ start_tasks.c	(working copy)
@@ -22,6 +22,7 @@
 #include <netdb.h>  /* gethostbyname for portals */
 
 #include "mpiexec.h"
+#include "tv_attach.h"
 
 #ifdef HAVE_PATH_H
 #  include <paths.h>
@@ -296,6 +297,7 @@ start_tasks(int spawn)
     int task_start, task_end;
     const char *mpiexec_redir_helper_path;
     char *psm_uuid = NULL;
+    int tv_port = 0;
 
     /* for looping from 0..numtasks in the case of MPI_Spawn */
     task_start = spawns[spawn].task_start;
@@ -555,6 +557,19 @@ start_tasks(int spawn)
     env_add_if_not("PATH", _PATH_DEFPATH);
     env_add_if_not("USER", pswd->pw_name);
 
+
+    /*
+     * Set up for totalview attach.  Returns local port number that will be
+     * used in command startup to tell processes how to find us.
+     *
+     * XXX: This does not play well with MPI_Comm_spawn.
+     */
+    if (cl_args->tview && cl_args->comm == COMM_MPICH2_PMI) {
+	env_add_int("PMI_TOTALVIEW", 1);
+	env_add_int("MPIEXEC_DEBUG", 1);
+	tv_port = tv_startup(task_end - task_start);
+    }
+
     /*
      * Spawn each task, adding its private env vars.
      * numspawned set to zero earlier before signal handler setup;
@@ -637,8 +652,24 @@ start_tasks(int spawn)
 	/* build proc-specific command line */
 	growstr_zero(g);
 	g->translate_single_quote = 0;
-	growstr_printf(g, "if test -d \"%s\"; then cd \"%s\"; fi; exec %s -c ",
-	  pwd, pwd, user_shell);
+
+	/*
+	 * Totalview is a bit odd, even hackish perhaps.  Send the pid
+	 * the just-starting process to ourselves via /dev/tcp, some sort
+	 * of virtual device that makes a TCP connection as told and sends
+	 * the echoed data.
+	 *
+	 * XXX: this works on what systems exactly?
+	 */
+	if (cl_args->tview && cl_args->comm == COMM_MPICH2_PMI)
+	    growstr_printf(g, "printf %%10d $$ > /dev/tcp/%s/%d; "
+			   "if test -d \"%s\"; then cd \"%s\"; fi; exec %s -c ",
+			   nodes[0].name, tv_port,
+			   pwd, pwd, user_shell);
+	else
+	    growstr_printf(g,
+			   "if test -d \"%s\"; then cd \"%s\"; fi; exec %s -c ",
+			   pwd, pwd, user_shell);
 	growstr_append(g, "'exec ");
 	g->translate_single_quote = 1;
 
@@ -654,12 +685,13 @@ start_tasks(int spawn)
 	    growstr_printf(g, "%s ", mpiexec_redir_helper_path);
 
 	/*
-	 * The executable, or a debugger wrapper around it.
+	 * The executable, or a debugger wrapper around it.  In the mpich2
+	 * case we don't need any special args.
 	 */
-	if (cl_args->tview) {
+	if (cl_args->tview && cl_args->comm != COMM_MPICH2_PMI) {
 	    if (i == 0)
-		growstr_printf(g, "totalview %s -a -mpichtv",
-		  tasks[i].conf->exe);
+		growstr_printf(g, "%s %s -a -mpichtv", tvname,
+			       tasks[i].conf->exe);
 	    else
 		growstr_printf(g, "%s -mpichtv", tasks[i].conf->exe);
 	} else
@@ -786,8 +818,13 @@ start_tasks(int spawn)
 		    break;
 	    }
 	}
+	if (cl_args->tview && cl_args->comm == COMM_MPICH2_PMI)
+	    tv_accept_one(i);
     }
 
+    if (cl_args->tview && cl_args->comm == COMM_MPICH2_PMI)
+       tv_complete();
+
     /* don't need these anymore */
     free(nargv[0]);
     free(nargv[1]);
Index: pmi.c
===================================================================
--- pmi.c	(revision 418)
+++ pmi.c	(working copy)
@@ -263,10 +263,12 @@ accept_pmi_conn(fd_set *rfs)
 	error_errno("%s: response cmd=set debug=%d", __func__, mpi_task_debug);
     }
 
-    /*
-     * XXX: PMI_TOTALVIEW env var means we must send another little
-     * string; add it sometime.
-     */
+    if (cl_args->tview) {
+       growstr_zero(g);
+       growstr_printf(g, "cmd=tv_ready\n");
+       if (write_full(fd, g->s, g->len) < 0)
+	   error_errno("%s: response cmd=tv_ready", __func__);
+    }
 
     /*
      * PMII_getmaxes
Index: mpiexec.c
===================================================================
--- mpiexec.c	(revision 418)
+++ mpiexec.c	(working copy)
@@ -43,6 +43,7 @@ int numtasks;
 int numspawns;
 struct passwd *pswd;
 struct sockaddr_in myaddr;
+char *tvname;
 
 /*
  * Ensure it's executable.  Return true if so.
@@ -507,8 +508,22 @@ parse_args(int *argcp, const char ***arg
 	    if (*cr || l <= 0)
 		error("argument -n requires positive integral number of nodes");
 	    cl_args->numproc = l;
-	} else if (!strcmp(cp, "tv") || !strncmp(cp, "totalview", MAX(2,len)))
+	} else if (!strcmp(cp, "tv") || !strncmp(cp, "totalview", MAX(2,len))) {
+	    char *tvenv;
 	    cl_args->tview = 1;
+	    tvenv = getenv("TOTALVIEW");
+	    if (tvenv != NULL) {
+		if (access(tvenv, X_OK) == 0)
+		    tvname = strdup(tvenv);
+		else {
+		    warning("%s: TOTALVIEW env variable \"%s\" not executable, "
+			    "trying totalview in PATH\n", __func__, tvname);
+		    tvenv = NULL;
+		}
+	    }
+	    if (tvenv == NULL)
+		tvname = strdup("totalview");
+	}
 	else if (!strncmp(cp, "config", MAX(3,len))) {
 	    cp += MAX(3,len);
 	    cl_args->config_file = find_optarg(cp, &argc, &argv, "config");
--- /dev/null	2007-12-11 14:36:49.554875183 -0500
+++ tv_attach.c	2008-03-21 12:35:14.000000000 -0400
@@ -0,0 +1,129 @@
+/*
+ * tv_attach.c - variables and routines for TotalView attachment
+ * 
+ * See: http://www-unix.mcs.anl.gov/mpi/mpi-debug/mpich-attach.txt
+ *
+ * Created: 02/2008 Frank Mietke <frank.mietke at s1998.tu-chemnitz.de>
+ *
+ * Distributed under the GNU Public License Version 2 or later (See LICENSE)
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>  /* nanosleep */
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include "util.h"
+#include "tv_attach.h"
+
+static int tv_socket;
+
+/*
+ * Totalview will read this structure directly from our address space,
+ * along with some other variables.
+ * 
+ * Changed the strings to const, hoping that it doesn't matter to totalview.
+ */
+typedef struct {
+    const char *host_name;
+    const char *executable_name;
+    int pid;
+} MPIR_PROCDESC;
+
+MPIR_PROCDESC *MPIR_proctable = NULL;
+int MPIR_proctable_size = 0;
+
+volatile int MPIR_debug_state = 0;
+volatile int MPIR_being_debugged = 0;
+int MPIR_i_am_starter = 0;
+int MPIR_partial_attach_ok = 0;
+
+/*
+ * This function is used by TotalView to detect a new event where is something
+ * to do by the debugger. Don't forget to set MPIR_debug_state to something
+ * useful before calling this function.
+ */
+void MPIR_Breakpoint(void)
+{
+}
+
+int tv_startup(int ntasks)
+{
+    int rc;
+    growstr_t *g;
+    struct sockaddr_in tv_sockaddr;
+    socklen_t tv_sockaddr_len = sizeof(tv_sockaddr);
+    struct timespec ts = { 0, 20L * 1000 * 1000 };  /* 20 ms */
+
+    MPIR_proctable = Malloc(ntasks * sizeof(*MPIR_proctable));
+
+    /* Creating socket for pid exchange of remote processes */
+    tv_socket = socket(PF_INET, SOCK_STREAM, 0);
+    if (tv_socket < 0)
+	error_errno("%s: socket ", __func__);
+
+    memset(&tv_sockaddr, 0, sizeof(tv_sockaddr));
+    tv_sockaddr.sin_family = AF_INET;
+    tv_sockaddr.sin_addr = myaddr.sin_addr;
+
+    rc = bind(tv_socket, (struct sockaddr *) &tv_sockaddr, tv_sockaddr_len);
+    if (rc)
+	error_errno("%s: bind", __func__);
+    rc = getsockname(tv_socket, (struct sockaddr *) &tv_sockaddr,
+		     &tv_sockaddr_len);
+    if (rc)
+	error_errno("%s: getsockname", __func__);
+    rc = listen(tv_socket, 32767);
+    if (rc)
+	error_errno("%s: listen", __func__);
+
+    /* start totalview and tell it to attach to this mpiexec process */
+    g = growstr_init();
+    growstr_printf(g, "%s -e \"dattach mpiexec %d; dgo; "
+		      "dassign MPIR_being_debugged 1\" &", tvname, getpid());
+    system(g->s);
+    growstr_free(g);
+
+    /* wait for totalview to find us */
+    while (!MPIR_being_debugged)
+	nanosleep(&ts, NULL);
+
+    return ntohs(tv_sockaddr.sin_port);
+}
+
+/*
+ * This is called once for each task startup.  Wait until the started
+ * task sends its pid, then enter that into the totalview table.  It
+ * might be better to do this in parallel with a service... handler
+ * like some other libraries have, but since this is only for the
+ * debugging case, maybe it is not too terrible to be slow.
+ */
+void tv_accept_one(int n)
+{
+    int pid, ps1;
+    char rpid[11];
+
+    ps1 = accept(tv_socket, 0, 0);
+    if (ps1 < 0)
+	error_errno("%s: accept totalview pid", __func__);
+    read_full(ps1, rpid, 10);
+    close(ps1);
+    rpid[10] = '\0';
+    pid = atoi(rpid);
+    MPIR_proctable[n].host_name = nodes[tasks[n].node].name;
+    MPIR_proctable[n].executable_name = tasks[n].conf->exe;
+    MPIR_proctable[n].pid = pid;
+    MPIR_proctable_size++;
+}
+
+void tv_complete(void)
+{
+    const int MPIR_DEBUG_SPAWNED = 1;
+
+    MPIR_debug_state = MPIR_DEBUG_SPAWNED;
+    MPIR_Breakpoint();
+    close(tv_socket);
+}
+
--- /dev/null	2007-12-11 14:36:49.554875183 -0500
+++ tv_attach.h	2008-03-21 12:38:08.000000000 -0400
@@ -0,0 +1,22 @@
+/*
+ * tv_attach.h - totalview PMI attachment header
+ * 
+ * See: http://www-unix.mcs.anl.gov/mpi/mpi-debug/mpich-attach.txt
+ *
+ * Created: 02/2008 Frank Mietke <frank.mietke at s1998.tu-chemnitz.de>
+ *
+ * Distributed under the GNU Public License Version 2 or later (See LICENSE)
+ */
+#ifndef __tv_attach_h
+#define __tv_attach_h
+
+/* Functions to be called by starter */
+int tv_startup(int ntasks);
+void tv_accept_one(int n);
+void tv_complete(void);
+
+/* Called by totalview */
+void MPIR_Breakpoint(void);
+
+#endif
+


More information about the mpiexec mailing list