SIGTSTP propagation for 0.80 ?
Roy Dragseth
Roy.Dragseth at cc.uit.no
Tue Sep 20 02:48:49 EDT 2005
On Tuesday 20 September 2005 03:08, Chris Samuel wrote:
> On Tue, 20 Sep 2005 02:45 am, David Golden wrote:
> > +++ Actually, I must confess my opinion is now veering towards that a
> > TSTP maybe shouldn't be involved in "qsig -s suspend" at all, if it's
> > possible to make torque SIGSTOP all tasks in the job, not just
> > the mother superior's.
>
> I don't know if this will be possible, as Torque has to deal with cases
> (such as MPICH and some commercial MPI apps, not to mention non-MPI
> parallel ones) where something like ssh/rsh is used to start the
> application across the nodes and then Torque often does not get the
> information about the processes it needs to keep track off on the other
> (non-MS) compute nodes.
>
> I suspect this is why Torque has effectively delegated that responsibility
> to the PBS script that has started the parallel job.
I agree, I don't think it will be feasible to change the way torque handles
signals as this will be a quite deep change in the way torque behaves.
Remember, we have a lot of history to take into account and there might be a
lot of scripts and apps that is adjusted to how it currently works.
To my knowlegde the following setups works with suspend/resume:
serial apps.
mpich with mpd (and possibly all other daemon based mpi implementations that
propagates STOP on TSTP)
These don't work
mpich/p4 with mpirun (This won't work until signal propagation is fixed in
ssh, maybe rsh works?)
mpich/p4 with mpiexec (0.78 with patch works, 0.80 don't?)
r.
--
The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
phone:+47 77 64 41 07, fax:+47 77 64 41 00
Roy Dragseth, High Performance Computing System Administrator
Direct call: +47 77 64 62 56. email: royd at cc.uit.no
More information about the mpiexec
mailing list