SIGTSTP propagation for 0.80 ?
Roy Dragseth
Roy.Dragseth at cc.uit.no
Mon Sep 19 16:29:57 EDT 2005
On Monday 19 September 2005 20:57, Roy Dragseth wrote:
> On Monday 19 September 2005 14:17, Chris Samuel wrote:
> > > The methodology is very simple (maybe too simple), send the SIGTSTP to
> > > all tasks, wait one second, send SIGSTOP to all tasks.
> >
> > Is that on all the mom's involved with the job, or just the mother
> > superior ?
>
> It seems to be only on the MS, this is odd.
Well, not that odd actually...
>
> I really thought this was done on all the moms in the job, but decided to
> check first before I answered. It seems like suspend/resume signals are
> handled differently than others, SIGKILL etc is done on all nodes in the
> job. I'll have to do some code spelunking tonight, I do believe that the
> right behaviour is to do the signalling on all nodes. Right?
>
After some testing and code inspection it seems like it is only the MS that
receives the signal, this goes for all types of signals. The sisters only
receive kill orders from MS when all tasks are terminated on MS, eg. on job
exit. I'll take this to the torque list and ask there.
As a related issue, I've seen some comments that qsig -s TSTP does the right
thing, but I cannot reproduce this myself. On my test cluster no mpiexec
jobs are stopped using this signal. This is with torque 1.2.0p5 and mpiexec
0.80. Anyone seeing this behaviour?
r.
--
The Computer Center, University of Tromsø, N-9037 TROMSØ, Norway.
phone:+47 77 64 41 07, fax:+47 77 64 41 00
Roy Dragseth, High Performance Computing System Administrator
Direct call: +47 77 64 62 56. email: royd at cc.uit.no
More information about the mpiexec
mailing list