[torqueusers] BUG: MOM segfaults
Garrick Staples
garrick at usc.edu
Wed Feb 2 17:26:08 EST 2005
After reading everything below and looking through the code some more. I still
don't think that call to set_globid() is needed. Maybe it was needed with
openpbs 2.3.12, but not with recent torques.
In addition, I'm realizing that mpiexec still doesn't work after restarting a
mom. I think the main reason is that the ji_stdout and ji_stderr port numbers
aren't saved with the job, the restarted mom can't contact the original
pbs_demux when a new TM_SPAWN request comes in.
I'm still looking into this stuff, so I may be changing my mind as I sort
everything out.
On Wed, Feb 02, 2005 at 11:35:55AM +1100, Chris Samuel alleged:
> /* CC'd to the mpiexec mailing list for Pete to comment on */
>
> On Wed, 2 Feb 2005 10:44 am, Garrick Staples wrote:
>
> > On Tue, Feb 01, 2005 at 10:50:22AM -0700, Marc Aurele La France alleged:
> > > Hi.
> > >
> > > init_abort_job() in src/resmom/catch_child.c contains a call to
> > > set_globid(pj,NULL). Consequently, it behooves all set_globid()
> >
> > I looked through the code a bit and began to doubt whether that
> > set_globid() call in init_abort_job() was actually required. The comment
> > says it came from the mpiexec patch. I commented it out and was able to
> > run new jobs with mpiexec just fine.
> >
> > Anyone from the mpiexec crowd know about this and can comment?
>
> This has come from the mpiexec patch against OpenPBS 2.3.12 to stop a
> restarting mom from killing a job launched by mpiexec:
>
> mpiexec-0.77/patch/pbs-2.3.12-mom-restart.diff
>
> The relevant fragment in that patch says:
>
> /* set the globid so mom does not coredump in response
> * to tm_spawn */
> set_globid(pj, 0);
>
> This patch and a description is listed in Pete's collection of OpenPBS patches
> at:
>
> http://www.osc.edu/~pw/pbs/
>
> It says:
>
> mom-restart.patch - Track running jobs properly across a mom restart.
>
> For mpiexec-spawned jobs to survive across a mom restart, and to enable proper
> accounting for all jobs which continue across a mom restart, this patch fixes
> some behavior of mom when restarted with the "-p" flag. Note that this patch
> adds functionality to the machine-specific part of the mom code for linux
> only. Users of other system types could cut-n-paste that code without too
> much problem, but as it stands, this patch will break compilation on
> non-linux systems.
>
> This patch does four things:
>
> - Fix coredump resulting from tm_spawn to restarted pbs_mom
> - Avoid race condition by which pbs_mom would sometimes kill itself as tasks
> exit.
> - Make a restarted pbs_mom search for and report exiting tasks from jobs which
> were started before the old mom was killed.
> - Change response of pbs_mom to various signals. Now the default is to leave
> all jobs running. If you want to stop all jobs, USR1 can be used to achieve
> the old behavior.
>
>
> cheers!
> Chris
> --
> Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
> Victorian Partnership for Advanced Computing http://www.vpac.org/
> Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://email.osc.edu/pipermail/mpiexec/attachments/20050202/35a4e669/attachment.bin
More information about the mpiexec
mailing list