SIGTSTP propagation for 0.80 ?

Roy Dragseth Roy.Dragseth at cc.uit.no
Tue Sep 20 02:48:49 EDT 2005


On Tuesday 20 September 2005 03:08, Chris Samuel wrote:
> On Tue, 20 Sep 2005 02:45 am, David Golden wrote:
> > +++ Actually, I must confess my opinion is now veering towards that a
> > TSTP maybe shouldn't be involved in "qsig -s suspend" at all, if it's
> > possible to make torque  SIGSTOP all tasks in the job, not just
> > the mother superior's.
>
> I don't know if this will be possible, as Torque has to deal with cases
> (such as MPICH and some commercial MPI apps, not to mention non-MPI
> parallel ones) where something like ssh/rsh is used to start the
> application across the nodes and then Torque often does not get the
> information about the processes it needs to keep track off on the other
> (non-MS) compute nodes.
>
> I suspect this is why Torque has effectively delegated that responsibility
> to the PBS script that has started the parallel job.

I agree, I don't think it will be feasible to change the way torque handles 
signals as this will be a quite deep change in the way torque behaves.  
Remember, we have a lot of history to take into account and there might be a 
lot of scripts and apps that is adjusted to how it currently works.

To my knowlegde the following setups works with suspend/resume:

serial apps.
mpich with mpd (and possibly all other daemon based mpi implementations that 
propagates STOP on TSTP)

These don't work
mpich/p4 with mpirun (This won't work until signal propagation is fixed in 
ssh, maybe rsh works?)
mpich/p4 with mpiexec (0.78 with patch works, 0.80 don't?)

r.
-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
     Roy Dragseth, High Performance Computing System Administrator
	 Direct call: +47 77 64 62 56. email: royd at cc.uit.no


More information about the mpiexec mailing list