Patch for TotalView support with MPICH2/MVAPICH2

Frank Mietke frank.mietke at informatik.tu-chemnitz.de
Wed Apr 2 08:33:27 EDT 2008


Hi,

> > 
> > MPICH2/MVAPICH2 have to be configured with --enable-debuginfo.
> 
> Great stuff.  I couldn't keep from meddling with it though.  Below
thanks for cleaning up my code. 

> is the patch that I'm currently thinking about.  (You'll get a small
> reject in Makefile.in unless you pull SVN first.)  It could go in
> now, but I thought you might like to look this over, fix my bugs,
> and maybe make some improvements.
As attachment the revised patch. I made some minor changes:

	- added two header files in tv_attach.c so that variables are known
	- inserted a nanosleep() in MPIR_Breakpoint to prevent optimization by the
	  compiler. You could do something else there if you think or you forbid
	  optimization during compilation of tv_attach.c
   - nc + bash support, see below
   - README.tv file

> 
> tvready is not used, so I removed it.  Not sure if this is
> important.
No, not needed anymore. I had something different in mind when I defined tvready.

> 
> I moved more of the totalview work into tv_attach.c, hopefully
> without breaking it, to keep start_tasks a bit cleaner---it is
> getting big.
Great, thank you.


> 
> Does this:
> 
> > +		growstr_printf(g, "printf \"%%10d\" $$ > /dev/tcp/%s/%d; if test -d \"%s\"; then cd \"%s\"; fi; exec %s -c ",
> 
> work on all sorts of machines?  Just linux?  I've always been
> baffled at how "ls /dev/tcp" shows nothing, but this magic happens.
> It is implement in glibc, right?  An alternate way would be to use
> the "nc" program, but I don't have a preference for that.
Communication through /dev/tcp is a feature of the shell. The shell opens a
socket and starts communication.
Okay, two issues. On our systems nc isn't installed, and Debian/Ubuntu systems
deactivated this feature in the bash and suggest using nc. I changed the script
above in this way that I check for nc first and if it is not there try using the
bash way. I have added some remarks to the README.tv file. 

> 
> Are both these:
> 
> > +	env_add_int("PMI_TOTALVIEW", 1);
This ensures that the tv_ready message is consumed by the processes.


> > +	env_add_int("MPIEXEC_DEBUG", 1);
This is checked in MPI_Init to wait on all processes to be attached by
totalview.

> 
> needed by mpich2?
> 
> I made these strings const:
> 
> > +typedef struct {
> > +	char *host_name;
> > +	char *executable_name;
> > +	int	pid;
> > +} MPIR_PROCDESC;
> 
> so we don't have to strdup() quite so much.  Hopefully totalview is
> okay with this and doesn't try to change them.
good.

> Would you be willing to attempt some documentation?  I don't know if
> it should go in README or mpiexec.1 or just at the top of
> tv_attach.c.  But I'm a bit confused by the model.  Old totalview
> support used to be solely in the app (MPI library) itself.  All
> mpiexec had to do was to invoke totalview with some args.  This new
> support spawns off a totalview to attach to mpiexec itself!  The
> info from your mail about how to set things up too.
See README.tv

> 
> Do we need to worry about killing off totalview ever?
If I make a "kill 9" on the totalview process than all other processes hangs and
must be killed the same way. With signal 15 on the totalview process the rest is
cleaned properly. If the job exceeds the walltime then everything is cleaned up
correctly as well. Principally, I think not because totalview takes control over
all processes.


> 
> What happens at MPI_Abort()?  Does everything go away cleanly?
MPI_Abort() of MPICH2 doesn't work correctly with mpiexec. Only the affected
process exits. With TotalView you have a kill button for the remaining processes.

> 
> The one-at-a-time spawn-then-attach model will be very slow for
> large values of "-np".  I'm not going to insist that that process be
> parallelized, although maybe someone will be bothered enough to do
> it someday.
To prevent a misunderstanding. In the for-loop for spawning the processes, only
the pid, hostname, executable is gathered. After this loop the call to
tv_complete() initiates the attachment step of totalview. But this process is
serialized too because totalview makes rsh/ssh to all processes. Nevertheless, there
is parallelization potential of course.

> 
> I take it that MPI_Conn_spawn() will not work with totalview?  This
> seems like a fundamental limitation in their interface.  Maybe we
> should just document that in tv_attach.c.
I'm not 100% sure but I think it could work but needs rework in mpiexec. 
Because a call to MPI_Comm_spawn() ends in a call to PMI_Spawn_multiple() 
which sends messages to the pmi_socket. Is it working, currently?


Sorry for the late response.


Frank


-- 
Dipl.-Inf. Frank Mietke     |     Fakultätsrechen- und Informationszentrum
Tel.: 0371 - 531 - 35538    |     Fak. für Informatik
Fax:  0371 - 531 8 35538    |     TU-Chemnitz
Key-ID: 60F59599            |     frank.mietke at informatik.tu-chemnitz.de
-------------- next part --------------
diff -Nru mpiexec/Makefile.in mpiexec_tv/Makefile.in
--- mpiexec/Makefile.in	2008-03-31 17:26:25.000000000 +0200
+++ mpiexec_tv/Makefile.in	2008-03-31 17:27:13.000000000 +0200
@@ -7,8 +7,8 @@
 #
 SRC   = mpiexec.c get_hosts.c start_tasks.c task.c event.c util.c config.c \
 	stdio.c growstr.c pmi.c gm.c ib.c psm.c p4.c rai.c concurrent.c \
-	exedist.c spawn.c
-H     = mpiexec.h util.h growstr.h list.h
+	exedist.c spawn.c tv_attach.c
+H     = mpiexec.h util.h growstr.h list.h tv_attach.h
 OTHER = ChangeLog LICENSE README mpiexec.1 proc-relations.fig \
 	hello.c hellof.f hellomp.f redir-helper.c \
 	runtests.pl README.lam
diff -Nru mpiexec/mpiexec.c mpiexec_tv/mpiexec.c
--- mpiexec/mpiexec.c	2008-03-31 17:26:25.000000000 +0200
+++ mpiexec_tv/mpiexec.c	2008-03-31 17:27:13.000000000 +0200
@@ -43,6 +43,7 @@
 int numspawns;
 struct passwd *pswd;
 struct sockaddr_in myaddr;
+char *tvname;
 
 /*
  * Ensure it's executable.  Return true if so.
@@ -507,8 +508,22 @@
 	    if (*cr || l <= 0)
 		error("argument -n requires positive integral number of nodes");
 	    cl_args->numproc = l;
-	} else if (!strcmp(cp, "tv") || !strncmp(cp, "totalview", MAX(2,len)))
+	} else if (!strcmp(cp, "tv") || !strncmp(cp, "totalview", MAX(2,len))) {
+	    char *tvenv;
 	    cl_args->tview = 1;
+	    tvenv = getenv("TOTALVIEW");
+	    if (tvenv != NULL) {
+		if (access(tvenv, X_OK) == 0)
+		    tvname = strdup(tvenv);
+		else {
+		    warning("%s: TOTALVIEW env variable \"%s\" not executable, "
+			    "trying totalview in PATH\n", __func__, tvname);
+		    tvenv = NULL;
+		}
+	    }
+	    if (tvenv == NULL)
+		tvname = strdup("totalview");
+	}
 	else if (!strncmp(cp, "config", MAX(3,len))) {
 	    cp += MAX(3,len);
 	    cl_args->config_file = find_optarg(cp, &argc, &argv, "config");
diff -Nru mpiexec/mpiexec.h mpiexec_tv/mpiexec.h
--- mpiexec/mpiexec.h	2008-03-31 17:26:25.000000000 +0200
+++ mpiexec_tv/mpiexec.h	2008-03-31 17:27:13.000000000 +0200
@@ -186,6 +186,7 @@
 extern int numspawns;  /* 1 + number of times MPI_Spawn called */
 extern struct passwd *pswd;  /* used for home dir, shell, user name */
 extern struct sockaddr_in myaddr;  /* for out-of-band MPI lib startup */
+extern char *tvname;  /* name of totalview executable, possibly from env var */
 
 /* concurrent.c */
 extern int concurrent_master;  /* if first mpiexec to run in job */
diff -Nru mpiexec/pmi.c mpiexec_tv/pmi.c
--- mpiexec/pmi.c	2008-03-31 17:26:25.000000000 +0200
+++ mpiexec_tv/pmi.c	2008-03-31 17:27:13.000000000 +0200
@@ -263,10 +263,12 @@
 	error_errno("%s: response cmd=set debug=%d", __func__, mpi_task_debug);
     }
 
-    /*
-     * XXX: PMI_TOTALVIEW env var means we must send another little
-     * string; add it sometime.
-     */
+    if (cl_args->tview) {
+       growstr_zero(g);
+       growstr_printf(g, "cmd=tv_ready\n");
+       if (write_full(fd, g->s, g->len) < 0)
+	   error_errno("%s: response cmd=tv_ready", __func__);
+    }
 
     /*
      * PMII_getmaxes
diff -Nru mpiexec/README.tv mpiexec_tv/README.tv
--- mpiexec/README.tv	1970-01-01 01:00:00.000000000 +0100
+++ mpiexec_tv/README.tv	2008-04-02 14:24:43.000000000 +0200
@@ -0,0 +1,56 @@
+mpiexec TotalView support
+
+Supported MPIs:
+
+   MPICH and MPICH2 plus their derivates.
+
+   Necessary configure options of supported MPIs:
+
+      MPICH   --enable-debug --enable-sharedlib
+      MPICH2  --enable-g=dbg --enable-shared=<kind>  --enable-debuginfo --enable-totalview
+
+
+Environment Variables (recommended):
+
+   TOTALVIEW          - full path to totalview executable
+   TVDSVRLAUNCHCMD    - set to rsh or ssh
+
+   If both are not set then it is assumed that the path to totalview
+   executable is in $PATH and the launch command for starting the tvdsvr
+   is set to rsh by TotalView.
+
+Further Requirements for MPICH2/PMI support:
+
+   For communicating the process ID back to the mpiexec process properly
+   it is necessary to install netcat (nc) or a bash (built with configure option
+   --enable-net-redirections, disabled in Debian/Ubuntu by default) on the nodes. 
+   Otherwise the job startup will hang forever.
+
+Usage:
+
+   mpiexec -tv <other options> <executable>
+
+   TotalView starts and the process window becomes visible. With MPICH you have
+   to press the "Go" button first, then you will be asked if you want to stop
+   the job now. Simply answer with pressing "Yes". With MPICH2 you will be asked
+   directly.
+
+Implementation Details of TotalView support in mpiexec:
+
+   MPICH:
+      Everything is handled inside the MPI library. The only part of mpiexec
+      is to invoke totalview on the node where MPI rank 0 is started with the
+      arguments "<executable> -a -mpichtv" (e.g. totalview ./foobar -a -mpichtv). 
+      The remaining processes are started simply by calling the executable with
+      the argument "-mpichtv" (e.g. ./foobar -mpichtv).
+
+   MPICH2:
+      The MPI library is only responsible for waiting on a special variable while in 
+      MPI_Init() which is set by TotalView debugger when all processes are attached. 
+      Everything else has to be done by the process manager (mpiexec in our case). A 
+      TotalView instance is spawned off and attaches to the mpiexec process. After 
+      successful startup mpiexec has to collect the pid, hostname and name of 
+      executable when starting all processes and put them in a specially named structure 
+      which TotalView reads in later.
+
+
diff -Nru mpiexec/start_tasks.c mpiexec_tv/start_tasks.c
--- mpiexec/start_tasks.c	2008-03-31 17:26:25.000000000 +0200
+++ mpiexec_tv/start_tasks.c	2008-04-02 10:18:52.000000000 +0200
@@ -22,6 +22,7 @@
 #include <netdb.h>  /* gethostbyname for portals */
 
 #include "mpiexec.h"
+#include "tv_attach.h"
 
 #ifdef HAVE_PATH_H
 #  include <paths.h>
@@ -296,6 +297,7 @@
     int task_start, task_end;
     const char *mpiexec_redir_helper_path;
     char *psm_uuid = NULL;
+    int tv_port = 0;
 
     /* for looping from 0..numtasks in the case of MPI_Spawn */
     task_start = spawns[spawn].task_start;
@@ -555,6 +557,19 @@
     env_add_if_not("PATH", _PATH_DEFPATH);
     env_add_if_not("USER", pswd->pw_name);
 
+
+    /*
+     * Set up for totalview attach.  Returns local port number that will be
+     * used in command startup to tell processes how to find us.
+     *
+     * XXX: This does not play well with MPI_Comm_spawn.
+     */
+    if (cl_args->tview && cl_args->comm == COMM_MPICH2_PMI) {
+	env_add_int("PMI_TOTALVIEW", 1);
+	env_add_int("MPIEXEC_DEBUG", 1);
+	tv_port = tv_startup(task_end - task_start);
+    }
+
     /*
      * Spawn each task, adding its private env vars.
      * numspawned set to zero earlier before signal handler setup;
@@ -637,8 +652,24 @@
 	/* build proc-specific command line */
 	growstr_zero(g);
 	g->translate_single_quote = 0;
-	growstr_printf(g, "if test -d \"%s\"; then cd \"%s\"; fi; exec %s -c ",
-	  pwd, pwd, user_shell);
+
+	/*
+	 * Totalview is a bit odd, even hackish perhaps.  Send the pid
+	 * the just-starting process to ourselves via /dev/tcp, some sort
+	 * of virtual device that makes a TCP connection as told and sends
+	 * the echoed data.
+	 *
+	 * XXX: this works on what systems exactly?
+	 */
+	if (cl_args->tview && cl_args->comm == COMM_MPICH2_PMI)
+	    growstr_printf(g, "if test -x nc; then printf%%10d $$ | nc %s %d; else printf %%10d $$ > /dev/tcp/%s/%d; fi; if test -d \"%s\"; then cd \"%s\"; fi; exec %s -c ",
+			   nodes[0].name, tv_port,
+			   nodes[0].name, tv_port,
+			   pwd, pwd, user_shell);
+	else
+	    growstr_printf(g,
+			   "if test -d \"%s\"; then cd \"%s\"; fi; exec %s -c ",
+			   pwd, pwd, user_shell);
 	growstr_append(g, "'exec ");
 	g->translate_single_quote = 1;
 
@@ -654,12 +685,13 @@
 	    growstr_printf(g, "%s ", mpiexec_redir_helper_path);
 
 	/*
-	 * The executable, or a debugger wrapper around it.
+	 * The executable, or a debugger wrapper around it.  In the mpich2
+	 * case we don't need any special args.
 	 */
-	if (cl_args->tview) {
+	if (cl_args->tview && cl_args->comm != COMM_MPICH2_PMI) {
 	    if (i == 0)
-		growstr_printf(g, "totalview %s -a -mpichtv",
-		  tasks[i].conf->exe);
+		growstr_printf(g, "%s %s -a -mpichtv", tvname,
+			       tasks[i].conf->exe);
 	    else
 		growstr_printf(g, "%s -mpichtv", tasks[i].conf->exe);
 	} else
@@ -786,8 +818,13 @@
 		    break;
 	    }
 	}
+	if (cl_args->tview && cl_args->comm == COMM_MPICH2_PMI)
+	    tv_accept_one(i);
     }
 
+    if (cl_args->tview && cl_args->comm == COMM_MPICH2_PMI)
+       tv_complete();
+
     /* don't need these anymore */
     free(nargv[0]);
     free(nargv[1]);
diff -Nru mpiexec/tv_attach.c mpiexec_tv/tv_attach.c
--- mpiexec/tv_attach.c	1970-01-01 01:00:00.000000000 +0100
+++ mpiexec_tv/tv_attach.c	2008-04-01 09:51:16.000000000 +0200
@@ -0,0 +1,133 @@
+/*
+ * tv_attach.c - variables and routines for TotalView attachment
+ * 
+ * See: http://www-unix.mcs.anl.gov/mpi/mpi-debug/mpich-attach.txt
+ *
+ * Created: 02/2008 Frank Mietke <frank.mietke at s1998.tu-chemnitz.de>
+ *
+ * Distributed under the GNU Public License Version 2 or later (See LICENSE)
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>  /* nanosleep */
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <unistd.h>
+#include "util.h"
+#include "tv_attach.h"
+#include "mpiexec.h"
+
+static int tv_socket;
+
+/*
+ * Totalview will read this structure directly from our address space,
+ * along with some other variables.
+ * 
+ * Changed the strings to const, hoping that it doesn't matter to totalview.
+ */
+typedef struct {
+    const char *host_name;
+    const char *executable_name;
+    int pid;
+} MPIR_PROCDESC;
+
+MPIR_PROCDESC *MPIR_proctable = NULL;
+int MPIR_proctable_size = 0;
+
+volatile int MPIR_debug_state = 0;
+volatile int MPIR_being_debugged = 0;
+int MPIR_i_am_starter = 0;
+int MPIR_partial_attach_ok = 0;
+
+/*
+ * This function is used by TotalView to detect a new event where is something
+ * to do by the debugger. Don't forget to set MPIR_debug_state to something
+ * useful before calling this function.
+ */
+void MPIR_Breakpoint(void)
+{
+	struct timespec ts = { 0, 1000 };
+	nanosleep(&ts, NULL);
+}
+
+int tv_startup(int ntasks)
+{
+    int rc;
+    growstr_t *g;
+    struct sockaddr_in tv_sockaddr;
+    socklen_t tv_sockaddr_len = sizeof(tv_sockaddr);
+    struct timespec ts = { 0, 20L * 1000 * 1000 };  /* 20 ms */
+
+    MPIR_proctable = Malloc(ntasks * sizeof(*MPIR_proctable));
+
+    /* Creating socket for pid exchange of remote processes */
+    tv_socket = socket(PF_INET, SOCK_STREAM, 0);
+    if (tv_socket < 0)
+	error_errno("%s: socket ", __func__);
+
+    memset(&tv_sockaddr, 0, sizeof(tv_sockaddr));
+    tv_sockaddr.sin_family = AF_INET;
+    tv_sockaddr.sin_addr = myaddr.sin_addr;
+
+    rc = bind(tv_socket, (struct sockaddr *) &tv_sockaddr, tv_sockaddr_len);
+    if (rc)
+	error_errno("%s: bind", __func__);
+    rc = getsockname(tv_socket, (struct sockaddr *) &tv_sockaddr,
+		     &tv_sockaddr_len);
+    if (rc)
+	error_errno("%s: getsockname", __func__);
+    rc = listen(tv_socket, 32767);
+    if (rc)
+	error_errno("%s: listen", __func__);
+
+    /* start totalview and tell it to attach to this mpiexec process */
+    g = growstr_init();
+    growstr_printf(g, "%s -e \"dattach mpiexec %d; dgo; "
+		      "dassign MPIR_being_debugged 1\" &", tvname, getpid());
+    system(g->s);
+    growstr_free(g);
+
+    /* wait for totalview to find us */
+    while (!MPIR_being_debugged)
+	nanosleep(&ts, NULL);
+
+    return ntohs(tv_sockaddr.sin_port);
+}
+
+/*
+ * This is called once for each task startup.  Wait until the started
+ * task sends its pid, then enter that into the totalview table.  It
+ * might be better to do this in parallel with a service... handler
+ * like some other libraries have, but since this is only for the
+ * debugging case, maybe it is not too terrible to be slow.
+ */
+void tv_accept_one(int n)
+{
+    int pid, ps1;
+    char rpid[11];
+
+    ps1 = accept(tv_socket, 0, 0);
+    if (ps1 < 0)
+	error_errno("%s: accept totalview pid", __func__);
+    read_full(ps1, rpid, 10);
+    close(ps1);
+    rpid[10] = '\0';
+    pid = atoi(rpid);
+    MPIR_proctable[n].host_name = nodes[tasks[n].node].name;
+    MPIR_proctable[n].executable_name = tasks[n].conf->exe;
+    MPIR_proctable[n].pid = pid;
+    MPIR_proctable_size++;
+}
+
+void tv_complete(void)
+{
+    const int MPIR_DEBUG_SPAWNED = 1;
+
+    MPIR_debug_state = MPIR_DEBUG_SPAWNED;
+    MPIR_Breakpoint();
+    close(tv_socket);
+}
+
diff -Nru mpiexec/tv_attach.h mpiexec_tv/tv_attach.h
--- mpiexec/tv_attach.h	1970-01-01 01:00:00.000000000 +0100
+++ mpiexec_tv/tv_attach.h	2008-04-01 09:50:25.000000000 +0200
@@ -0,0 +1,22 @@
+/*
+ * tv_attach.h - totalview PMI attachment header
+ * 
+ * See: http://www-unix.mcs.anl.gov/mpi/mpi-debug/mpich-attach.txt
+ *
+ * Created: 02/2008 Frank Mietke <frank.mietke at s1998.tu-chemnitz.de>
+ *
+ * Distributed under the GNU Public License Version 2 or later (See LICENSE)
+ */
+#ifndef __tv_attach_h
+#define __tv_attach_h
+
+/* Functions to be called by starter */
+int tv_startup(int ntasks);
+void tv_accept_one(int n);
+void tv_complete(void);
+
+/* Called by totalview */
+void MPIR_Breakpoint(void);
+
+#endif
+


More information about the mpiexec mailing list