suspend jobs

Pete Wyckoff pw at osc.edu
Thu Apr 22 13:17:52 EDT 2004


Sebastien.Georget at sophia.inria.fr wrote on Thu, 22 Apr 2004 16:55 +0200:
>   I would like to know if there are plans to add suspend support to 
> mpiexec. I wrote a small patch (see attachment) which seems to do the 
> job. It is mainly cut&paste code but it seems to show that it is 
> possible (with simple mpi progs at least). Does anybody else work on 
> this subject ?

I do think it's a worthy idea, but I've been hesitating changing
anything due to some confusion on my part about how this should work
in all cases.  I'd like to get it right the first time if possible.

Interactive

    When the user hits ^Z to suspend a running mpiexec, the current
    behavior is that the running parallel job stays running.  You think
    the right thing to do is to send SIGSTOP to the processes then
    suspend?  (Then on continue, wake them up too of course.)

Batch

    In batch use it's quite a different scenario.  Here I imagine that a
    scheduler or user might want to suspend a job or send it a HUP to
    reread configuration files or some such.  With PBS the command
    "qsig" sends an arbitrary signal to all tasks in the job session on
    the first node.  Sending a SIGSTOP will stop the shell running your
    script, the mpiexec, and the processes of the parallel job that
    happen to be running on node #0.  But it doesn't send the signal to
    any other nodes.

    This does not seem like the right behavior to me.  I think one
    generally wants to do something to all the processes.  That
    conclusion implies the same answer in the interactive case, for
    symmetry.

    To implement this right requires fixing PBS.  It would be a hack
    for mpiexec to catch the signal and send it to the processes running
    on other nodes.  Plus some people may use PBS without running
    mpiexec, believe it or not.

Any opinions welcome.  I'm sort of tempted to change the way ^Z is
handled in the interactive case since Sébastien already coded it up,
and just add a note in the README that PBS qsig works differently.
But then if somebody had a PBS patch to fix qsig, that would be great
to toss in too.

		-- Pete



More information about the mpiexec mailing list