suspend jobs
Pete Wyckoff
pw at osc.edu
Thu Apr 22 13:17:52 EDT 2004
Sebastien.Georget at sophia.inria.fr wrote on Thu, 22 Apr 2004 16:55 +0200:
> I would like to know if there are plans to add suspend support to
> mpiexec. I wrote a small patch (see attachment) which seems to do the
> job. It is mainly cut&paste code but it seems to show that it is
> possible (with simple mpi progs at least). Does anybody else work on
> this subject ?
I do think it's a worthy idea, but I've been hesitating changing
anything due to some confusion on my part about how this should work
in all cases. I'd like to get it right the first time if possible.
Interactive
When the user hits ^Z to suspend a running mpiexec, the current
behavior is that the running parallel job stays running. You think
the right thing to do is to send SIGSTOP to the processes then
suspend? (Then on continue, wake them up too of course.)
Batch
In batch use it's quite a different scenario. Here I imagine that a
scheduler or user might want to suspend a job or send it a HUP to
reread configuration files or some such. With PBS the command
"qsig" sends an arbitrary signal to all tasks in the job session on
the first node. Sending a SIGSTOP will stop the shell running your
script, the mpiexec, and the processes of the parallel job that
happen to be running on node #0. But it doesn't send the signal to
any other nodes.
This does not seem like the right behavior to me. I think one
generally wants to do something to all the processes. That
conclusion implies the same answer in the interactive case, for
symmetry.
To implement this right requires fixing PBS. It would be a hack
for mpiexec to catch the signal and send it to the processes running
on other nodes. Plus some people may use PBS without running
mpiexec, believe it or not.
Any opinions welcome. I'm sort of tempted to change the way ^Z is
handled in the interactive case since Sébastien already coded it up,
and just add a note in the README that PBS qsig works differently.
But then if somebody had a PBS patch to fix qsig, that would be great
to toss in too.
-- Pete
More information about the mpiexec
mailing list