Zombie mpiexec processes

Martin Schafföner martin.schaffoener at e-technik.uni-magdeburg.de
Fri Feb 3 07:51:05 EST 2006


On Thursday 02 February 2006 17:45, Pete Wyckoff wrote:
> I've been playing around with this trying to figure out how you get
> these stranded zombies.  Because mpiexec uses a second process to
> handle stdio, I first feared that something was going awry with that
> termination process, but I couldn't convince myself there was

So, it seems it's neither mpiexec's nor our spawner's but rather PBS' fault. 
The spawner spawns an mpiexec which in turn spawns mpiexec-stdio. The command 
gets executed on the remote node (which in this case was the local node) and 
finishes correctly. mpiexec-stdio finds out (or is told?) and exits, going 
into zombie waiting for the "real" mpiexec to reap it. But this real mpiexec 
never exits. This is what I found in the PBS mom logfile:

02/02/2006 23:00:41;0008;   pbs_mom;Job;6222.master;start_process: task 
started, tid 3393, sid 30407, cmd /bin/sh
02/02/2006 23:00:41;0008;   pbs_mom;Job;6222.master;start_process: task 
started, tid 3395, sid 30410, cmd /bin/sh
02/02/2006 23:02:11;0080;   pbs_mom;Job;6222.master;scan_for_terminated: job 
6222.master task 562 terminated, sid 30410
02/02/2006 23:02:17;0080;   pbs_mom;Job;6222.master;scan_for_terminated: job 
6222.master task 3393 terminated, sid 30407

Notice it starts TID 3395 which never terminates, instead it finds terminated 
TID 562. But TID 562 had terminated a long time ago:

02/02/2006 15:32:12;0008;   pbs_mom;Job;6222.master;start_process: task 
started, tid 562, sid 30410, cmd /bin/sh
02/02/2006 15:32:12;0008;   pbs_mom;Job;6222.master;start_process: task 
started, tid 563, sid 30411, cmd /bin/sh
02/02/2006 15:33:33;0080;   pbs_mom;Job;6222.master;scan_for_terminated: job 
6222.master task 562 terminated, sid 30410
02/02/2006 15:33:38;0080;   pbs_mom;Job;6222.master;scan_for_terminated: job 
6222.master task 563 terminated, sid 30411

The striking thing is that both TID 562 and TID 3395 have SID 30410, don't 
know if this has any influence or not.

All this happened with Torque 2.0.0p5, we are currently trying with the latest 
2.0.0p7 to see if this fixes the problem. If not, I guess it's a Torque bug, 
right?

Regards,
-- 
Martin Schafföner

Cognitive Systems Group, Institute of Electronics, Signal Processing and 
Communication Technologies, Department of Electrical Engineering, 
Otto-von-Guericke University Magdeburg
Phone: +49 391 6720063


More information about the mpiexec mailing list