mpiexec -nolocal ignored

Maestas, Christopher Daniel cdmaest at sandia.gov
Fri May 12 17:22:47 EDT 2006


I seem to occasionally see hangs using -nolocal and -nostdout.  This is
with the svn version and the previous patch.
All the mpi processes die, but the mpiexec process on the local node
like sticking around.
I'm still investigating.  Here's a backtrace:
---
$ gdb_backtrace --match mpiexec
========> PID 28204, CMD /home/cdmaest/tmp/mpiexec -nolocal -nostdout -v
-v -v -np 1024
/scratch3/cdmaest/cbench-tests-mvapich-cisco-3.2.0-04062006-devel-intel/
mpioverhead/mpi_overhead, STATE S <=============
========> PID 28202, CMD /home/cdmaest/tmp/mpiexec -nolocal -nostdout -v
-v -v -np 1024
/scratch3/cdmaest/cbench-tests-mvapich-cisco-3.2.0-04062006-devel-intel/
mpioverhead/mpi_overhead, STATE S <=============
$ Using host libthread_db library "/lib64/tls/libthread_db.so.1".
Using host libthread_db library "/lib64/tls/libthread_db.so.1".
0x0000003ecb2bc67f in poll () from /lib64/tls/libc.so.6
#0  0x0000003ecb2bc67f in poll () from /lib64/tls/libc.so.6
#1  0x0000000000409eac in stdio_fork (expected_in=0x2a95a90010,
abort_fd_in=0x2, pmi_fd_in=4) at stdio.c:1301
#2  0x0000000000405b57 in start_tasks (spawn=0) at start_tasks.c:362
#3  0x0000000000403c6a in main (argc=1, argv=0x7fbfffeaf8) at
mpiexec.c:781
0x0000003ecb2be445 in __select_nocancel () from /lib64/tls/libc.so.6
#0  0x0000003ecb2be445 in __select_nocancel () from /lib64/tls/libc.so.6
#1  0x0000000000410b24 in cm_check_clients () at concurrent.c:1081
#2  0x0000000000406947 in wait_tasks () at task.c:220
#3  0x0000000000403c77 in main (argc=1, argv=0x7fbfffeaf8) at
mpiexec.c:804
--- 


-----Original Message-----
From: Pete Wyckoff [mailto:pw at osc.edu] 
Sent: Friday, May 12, 2006 1:57 PM
To: Maestas, Christopher Daniel
Cc: mpiexec at osc.edu
Subject: Re: mpiexec -nolocal ignored

cdmaest at sandia.gov wrote on Fri, 12 May 2006 09:41 -0600:
> I noted that the -nolocal feature didn't seem to be working.
> I created the following patch but you may be able to come up with 
> something else :-)
> 
> diff -ur mpiexec-0.81/get_hosts.c mpiexec-0.81.patched/get_hosts.c
> --- mpiexec-0.81/get_hosts.c    2006-04-19 21:53:26.000000000 -0600
> +++ mpiexec-0.81.patched/get_hosts.c  2006-05-12 09:38:18.349847000
> -0600
> @@ -341,7 +341,10 @@
>  
>      /* enforce one process per physical node by strcmp on host name
*/
>      if (cl_args->pernode) {
> - for (i=0; i<numnodes; i++) {
> + i=0;
> + if (cl_args->nolocal)
> +  i=1;
> + for (; i<numnodes; i++) {
>       numleft -= (nodes[i].availcpu - 1);
>       nodes[i].availcpu = 1;
>   }

Thanks, that was a fine little bug.  Here's the fix I ended up adding.
There's a new test in runtests.pl is the SVN too.  D'oh.

		-- Pete

Index: get_hosts.c
===================================================================
--- get_hosts.c (revision 369)
+++ get_hosts.c (working copy)
@@ -342,8 +342,10 @@
     /* enforce one process per physical node by strcmp on host name */
     if (cl_args->pernode) {
        for (i=0; i<numnodes; i++) {
-           numleft -= (nodes[i].availcpu - 1);
-           nodes[i].availcpu = 1;
+           if (nodes[i].availcpu > 0) {
+               numleft -= (nodes[i].availcpu - 1);
+               nodes[i].availcpu = 1;
+           }
        }
     }
 





More information about the mpiexec mailing list