Zombie mpiexec processes

Martin Schafföner martin.schaffoener at e-technik.uni-magdeburg.de
Tue Feb 7 10:32:38 EST 2006


On Tuesday 07 February 2006 16:16, Pete Wyckoff wrote:

> For what it's worth, I agree completely with your diagnosis of the
> problem that you posted on the torque list, and your suggested fix.

Unfortunately, I haven't yet got a reply from the torque guys.

> The log sample you sent me had 59 cases where a later process had
> the same pid as an earlier one.  For some reason, PBS remembers all

Hm. Those later processes must have either not been tracked (directly spawned) 
by PBS, or must have been PID from other nodes, cause the mpiexec-PBS combo 
definitely gets stuck on the first task reusing a PID of an earlier task. 
Verified it by executing a few (hundred) mpiexec calls and then 
fast-forwarding PIDs with some 30000 calls to /bin/true

> tasks ever spawned, even after they're exited.  When one of these
> repeated-pid processes exited, it was matched by pid to an earlier
> task ID and PBS did not report a termination event to mpiexec.
>
> There's really no way we can work around the pid-wrap problem in
> mpiexec barring some really icky painful hacks.  Hopefully the

Yeah, it really isn't an mpiexec bug.

> torque developers will fix this soon.  Fortunately your usage
> pattern is a bit heavier than most, and no one else has ever run
> into this problem before.

Actually, I'm quite surprised about it considering how long PBS/torque has 
been around.

CU,
-- 
Martin Schafföner

Cognitive Systems Group, Institute of Electronics, Signal Processing and 
Communication Technologies, Department of Electrical Engineering, 
Otto-von-Guericke University Magdeburg
Phone: +49 391 6720063


More information about the mpiexec mailing list