SIGTSTP propagation for 0.80 ?
David Golden
dgolden at cp.dias.ie
Thu Sep 22 11:34:34 EDT 2005
On 2005-09-22 10:08:22 -0400, Pete Wyckoff wrote:
> dgolden at cp.dias.ie wrote on Thu, 22 Sep 2005 14:03 +0100:
> > Indeed. I just posted to torqueusers [1] about signal propagation behaviours
> > as queue/job attributes in a bit more depth, such a feature might at
> > least solve some such problems, barring that whole horrible
> > processes-wandering-out-of-pgrps thing that you probably need
> > selinux and/or sgi csa/pagg stuff to deal with on linux.
> >
> > [1] http://www.supercluster.org/pipermail/torqueusers/2005-September/002176.html
>
> Worth thinking about for sure. The only thing I'm worried about is
> if the job launcher (e.g. mpiexec) needs to know about the setting in
> the job script to do its kill/suspend behavior correctly.
> If there's
> no way for mpiexec to know which behavior is expected, how can it
> avoid doing the wrong thing?
I suppose it can't reliably :-(. On the other hand, it is a bit of
a "doctor, it hurts when I... / well, don't" situation: if
torque is deliberately set to depart from PBS-classical behaviour, and
mpiexec isn't set to correspond, then things will go wrong, predictably.
The defaults of both torque and mpiexec could presumably remain
the classic behaviour.
But all the same, you might well get a rather increased amount of
problem reports. :-((
(If torque did have the ability to fiddle with signal propagation,
maybe mpiexec _could_ actually query torque for the signal propagation
settings, probably once at start time, and warn if something was
obviously amiss, though it would bring extra complexity to mpiexec).
> I guess I'm still tempted to do what PBS tradition has always done
> and work with you to adapt your original patch that directly
> suspends all the tasks under the control of mpiexec.
Yep, it's a very nice feature for mpiexec regardless of whether
torque decides to do anything newfangled to solve wider
problems, and solves immediate need for working suspension
for mpich jobs.
So: back to work.
patch 0.80.tstp1 has the problem interactive jobs don't quite work, probably
because I didn't even really consider job-controlled-shell requirements,
just TSTP-propagating-STOP required for suspend. That needs fixing
before it's viable for inclusion.
(My earlier problems with qsig -s suspend not actually working although
qsig -s TSTP seem to be system-specific and not repeatable on another
(x86) cluster, and are not relevant to the mpiexec side, they occur
just the same with a bash script with traps.)
More information about the mpiexec
mailing list