Zombie mpiexec processes
Pete Wyckoff
pw at osc.edu
Tue Feb 7 10:16:24 EST 2006
pw at osc.edu wrote on Thu, 02 Feb 2006 11:45 -0500:
> martin.schaffoener at e-technik.uni-magdeburg.de wrote:
> > So, we've made our test fail, with PID 10816 on MS being the zombie mpiexec
> > process. The logfile around "10816" says:
> >
> > kill_stdio: sent SIGTERM, waiting on 10816
>
> This line happens right before mpiexec goes into waitpid(). It will
> patiently wait until stdio exits. Meanwhile stdio is, for up to
> five seconds after the SIGTERM is received, waiting for connections
> to close then exiting. With lots of "-v -v -v" stdio should print
> a couple lines about what it's doing.
For what it's worth, I agree completely with your diagnosis of the
problem that you posted on the torque list, and your suggested fix.
The log sample you sent me had 59 cases where a later process had
the same pid as an earlier one. For some reason, PBS remembers all
tasks ever spawned, even after they're exited. When one of these
repeated-pid processes exited, it was matched by pid to an earlier
task ID and PBS did not report a termination event to mpiexec.
There's really no way we can work around the pid-wrap problem in
mpiexec barring some really icky painful hacks. Hopefully the
torque developers will fix this soon. Fortunately your usage
pattern is a bit heavier than most, and no one else has ever run
into this problem before.
-- Pete
More information about the mpiexec
mailing list