Zombie mpiexec processes

Pete Wyckoff pw at osc.edu
Tue Feb 7 10:16:24 EST 2006


pw at osc.edu wrote on Thu, 02 Feb 2006 11:45 -0500:
> martin.schaffoener at e-technik.uni-magdeburg.de wrote:
> > So, we've made our test fail, with PID 10816 on MS being the zombie mpiexec
> > process. The logfile around "10816" says:
> > 
> > kill_stdio: sent SIGTERM, waiting on 10816
> 
> This line happens right before mpiexec goes into waitpid().  It will
> patiently wait until stdio exits.  Meanwhile stdio is, for up to
> five seconds after the SIGTERM is received, waiting for connections
> to close then exiting.  With lots of "-v -v -v" stdio should print
> a couple lines about what it's doing.

For what it's worth, I agree completely with your diagnosis of the
problem that you posted on the torque list, and your suggested fix.

The log sample you sent me had 59 cases where a later process had
the same pid as an earlier one.  For some reason, PBS remembers all
tasks ever spawned, even after they're exited.  When one of these
repeated-pid processes exited, it was matched by pid to an earlier
task ID and PBS did not report a termination event to mpiexec.

There's really no way we can work around the pid-wrap problem in
mpiexec barring some really icky painful hacks.  Hopefully the
torque developers will fix this soon.  Fortunately your usage
pattern is a bit heavier than most, and no one else has ever run
into this problem before.

		-- Pete


More information about the mpiexec mailing list