Is it possible to suspend/resume mpi jobs with MPIEXEC ?(Through PBS or PBS PRO)

Pete Wyckoff pw at osc.edu
Thu Mar 18 09:01:25 EST 2004


francois at hpce.nec.com said on Thu, 18 Mar 2004 11:55 +0100:
> The point for production reasons is to suspend MPI jobs in a Myrinet cluster
> and restart them later.
> 
> Myricom (see below) pointed out that it could be done through the mpiexec
> interface.
> 
> The questions i have is : ( I assume that PBS or PBS pro is sending a SIGSTP
> signal to mpirun/mpiexec)
> 
> *SIGSTOP is uncaughtable , then how will mpiexec behave ?
> 
> *Assuming the above issue is fixed how mpiexec will manage to
> freeze(suspend) all mpi processes,
>   taking care of that all MPI traffic has been succesfully completed ?

Mpiexec will not do anything special for you here.  Presumably you can
get PBS to send SIGSTOP to all tasks associated with the job.  All of
the tasks started by mpiexec are monitored by PBS.  Have you tried to
see if it works yet?

The big problem is figuring out how to quiesce the Myrinet NICs in case
you want to reuse the same ports for non-suspended jobs on the same
nodes.

		-- Pete



More information about the mpiexec mailing list