Zombie mpiexec processes
Pete Wyckoff
pw at osc.edu
Thu Feb 2 11:45:18 EST 2006
martin.schaffoener at e-technik.uni-magdeburg.de wrote:
> So, we've made our test fail, with PID 10816 on MS being the zombie mpiexec
> process. The logfile around "10816" says:
>
> kill_stdio: sent SIGTERM, waiting on 10816
This line happens right before mpiexec goes into waitpid(). It will
patiently wait until stdio exits. Meanwhile stdio is, for up to
five seconds after the SIGTERM is received, waiting for connections
to close then exiting. With lots of "-v -v -v" stdio should print
a couple lines about what it's doing.
> Don't know what other parts of the logfile would be interesting, it's about
> 29M uncompressed.
I'll be happy to look through it if you want to post it (ftp/http)
somewhere.
> I take back my last comment and testify the opposite. "-nostdin -nostdout"
> does _not_ fix the problem. We again have an mpiexec process dangling around,
> but not in zombie state. Even though the respective client command is
> finished, the mpiexec process on MS is not.
I've been playing around with this trying to figure out how you get
these stranded zombies. Because mpiexec uses a second process to
handle stdio, I first feared that something was going awry with that
termination process, but I couldn't convince myself there was
anything wrong there. Some details:
start: mpiexec forks, creating process I'll call "stdio"
if stdio exits normally because all connections to tasks are
closed, it waits in zombie state until mpiexec waits for
it or exits. This is normal.
if stdio dies unnaturally either
mpiexec is also exiting and does the waitpid() to reap it, or
mpiexec dies, stdio reparented to init, init reaps it
if mpiexec dies unnaturally (via kill -9 or segv)
stdio continues to interact with processes
stdio exits if/when all processes disconnect
stdio is reaped by init
> A master process spawns multiple "mpiexec -n 1 -comm none ..."
> processes and waits for them to finish.
So now I'm suspecting it has something to do with your process that
is calling system() or fork/exec to start these individual
one-process mpiexecs. Using "-nostdin -nostdout" means that no
stdio process will be forked, and your observation that there are
still zombie processes points back to your top-level spawner. You
can do "ps xwl" or equivalent to verify that these zombies have PPID
of your spawner process.
Does this help narrow it down?
-- Pete
More information about the mpiexec
mailing list