Zombie mpiexec processes
Pete Wyckoff
pw at osc.edu
Thu Jan 19 10:57:15 EST 2006
martin.schaffoener at e-technik.uni-magdeburg.de wrote on Thu, 19 Jan 2006 12:54 +0100:
> We have a scenario where a list of files is processed in parallel on several
> nodes. A master process spawns multiple "mpiexec -n 1 -comm none ..."
> processes and waits for them to finish, after which new mpiexec processes can
> be spawned until the list is finished. Unfortunately, some of these mpiexec
> child processes go zombie, thus never return, and the job does not get
> finished.
>
> Does anybody have any idea why these processes might go zombie? We use mpiexec
> 0.80 linked against torque 2.0.0p5.
Do you mean you're using "mpiexec -server", then starting up lots of
"mpiexec -n 1" in parallel from some sort of script?
Internally, each of these "mpiexec -n 1" forks into a stdio listener
that will hopefully exit when the main process exits. Perhaps this
isn't happening for some reason and your script ends up owning this
orphaned stdio listener process?
If it's repeatable you might try a small test with "-v -v -v" in the
"mpiexec -n 1" to see what, if anything, is going awry. Another
approach would be to use "-nostdin -nostdout" to prevent the fork of
the stdio listener and see if that fixes it, narrowing the problem a
bit.
-- Pete
More information about the mpiexec
mailing list