mpiexec -nolocal ignored
Pete Wyckoff
pw at osc.edu
Mon May 15 12:52:18 EDT 2006
cdmaest at sandia.gov wrote on Fri, 12 May 2006 15:22 -0600:
> I seem to occasionally see hangs using -nolocal and -nostdout. This is
> with the svn version and the previous patch.
> All the mpi processes die, but the mpiexec process on the local node
> like sticking around.
[..]
> #0 0x0000003ecb2bc67f in poll () from /lib64/tls/libc.so.6
> #1 0x0000000000409eac in stdio_fork (expected_in=0x2a95a90010,
> abort_fd_in=0x2, pmi_fd_in=4) at stdio.c:1301
> #2 0x0000000000405b57 in start_tasks (spawn=0) at start_tasks.c:362
> #3 0x0000000000403c6a in main (argc=1, argv=0x7fbfffeaf8) at mpiexec.c:781
The stdio listener thread is waiting for something to happen on
stdin, stdout or stderr. Normal.
> #0 0x0000003ecb2be445 in __select_nocancel () from /lib64/tls/libc.so.6
> #1 0x0000000000410b24 in cm_check_clients () at concurrent.c:1081
> #2 0x0000000000406947 in wait_tasks () at task.c:220
> #3 0x0000000000403c77 in main (argc=1, argv=0x7fbfffeaf8) at mpiexec.c:804
The main process is waiting on some tasks to finish. Maybe they
didn't all die, or the PBS mom didn't tell us about all the ones
that died. With the -verbose flag you had on, these lines should
appear:
mpiexec: wait_tasks: waiting for ...
showing you what mpiexec thinks is still running. You can compare
against the mom log file on each compute node to see if there is
disagreement somewhere. Each task should cause a message there at
startup and termination.
Let me know if you find something awry.
-- Pete
More information about the mpiexec
mailing list