mpiexec -nolocal ignored
Maestas, Christopher Daniel
cdmaest at sandia.gov
Fri May 12 17:22:47 EDT 2006
I seem to occasionally see hangs using -nolocal and -nostdout. This is
with the svn version and the previous patch.
All the mpi processes die, but the mpiexec process on the local node
like sticking around.
I'm still investigating. Here's a backtrace:
---
$ gdb_backtrace --match mpiexec
========> PID 28204, CMD /home/cdmaest/tmp/mpiexec -nolocal -nostdout -v
-v -v -np 1024
/scratch3/cdmaest/cbench-tests-mvapich-cisco-3.2.0-04062006-devel-intel/
mpioverhead/mpi_overhead, STATE S <=============
========> PID 28202, CMD /home/cdmaest/tmp/mpiexec -nolocal -nostdout -v
-v -v -np 1024
/scratch3/cdmaest/cbench-tests-mvapich-cisco-3.2.0-04062006-devel-intel/
mpioverhead/mpi_overhead, STATE S <=============
$ Using host libthread_db library "/lib64/tls/libthread_db.so.1".
Using host libthread_db library "/lib64/tls/libthread_db.so.1".
0x0000003ecb2bc67f in poll () from /lib64/tls/libc.so.6
#0 0x0000003ecb2bc67f in poll () from /lib64/tls/libc.so.6
#1 0x0000000000409eac in stdio_fork (expected_in=0x2a95a90010,
abort_fd_in=0x2, pmi_fd_in=4) at stdio.c:1301
#2 0x0000000000405b57 in start_tasks (spawn=0) at start_tasks.c:362
#3 0x0000000000403c6a in main (argc=1, argv=0x7fbfffeaf8) at
mpiexec.c:781
0x0000003ecb2be445 in __select_nocancel () from /lib64/tls/libc.so.6
#0 0x0000003ecb2be445 in __select_nocancel () from /lib64/tls/libc.so.6
#1 0x0000000000410b24 in cm_check_clients () at concurrent.c:1081
#2 0x0000000000406947 in wait_tasks () at task.c:220
#3 0x0000000000403c77 in main (argc=1, argv=0x7fbfffeaf8) at
mpiexec.c:804
---
-----Original Message-----
From: Pete Wyckoff [mailto:pw at osc.edu]
Sent: Friday, May 12, 2006 1:57 PM
To: Maestas, Christopher Daniel
Cc: mpiexec at osc.edu
Subject: Re: mpiexec -nolocal ignored
cdmaest at sandia.gov wrote on Fri, 12 May 2006 09:41 -0600:
> I noted that the -nolocal feature didn't seem to be working.
> I created the following patch but you may be able to come up with
> something else :-)
>
> diff -ur mpiexec-0.81/get_hosts.c mpiexec-0.81.patched/get_hosts.c
> --- mpiexec-0.81/get_hosts.c 2006-04-19 21:53:26.000000000 -0600
> +++ mpiexec-0.81.patched/get_hosts.c 2006-05-12 09:38:18.349847000
> -0600
> @@ -341,7 +341,10 @@
>
> /* enforce one process per physical node by strcmp on host name
*/
> if (cl_args->pernode) {
> - for (i=0; i<numnodes; i++) {
> + i=0;
> + if (cl_args->nolocal)
> + i=1;
> + for (; i<numnodes; i++) {
> numleft -= (nodes[i].availcpu - 1);
> nodes[i].availcpu = 1;
> }
Thanks, that was a fine little bug. Here's the fix I ended up adding.
There's a new test in runtests.pl is the SVN too. D'oh.
-- Pete
Index: get_hosts.c
===================================================================
--- get_hosts.c (revision 369)
+++ get_hosts.c (working copy)
@@ -342,8 +342,10 @@
/* enforce one process per physical node by strcmp on host name */
if (cl_args->pernode) {
for (i=0; i<numnodes; i++) {
- numleft -= (nodes[i].availcpu - 1);
- nodes[i].availcpu = 1;
+ if (nodes[i].availcpu > 0) {
+ numleft -= (nodes[i].availcpu - 1);
+ nodes[i].availcpu = 1;
+ }
}
}
More information about the mpiexec
mailing list