strange error(when hitting walltime limit)

Pete Wyckoff pw at osc.edu
Wed Jul 2 15:09:07 EDT 2003


beaneg at umcs.maine.edu said on Wed, 02 Jul 2003 11:08 -0400:
> When I use mpirun, if a job hits its walltime limit I get a message
> stating that.  With mpiexec if a job hits its walltime limit, I get a
> mpiexec: warning: main: task x died with signal 15 in the stderr file. 
> I know the job hit it's walltime, and prior to that it was running
> properly.
> 
> I'm using the latest version of PBS Pro.  Is this normal behavior of
> mpiexec?

Perfectly normal, and in fact desired.

Using mpirun, the pbs mom will kill all processes that it can find on
the mother superior node.  Eventually the MPI processes on other nodes
will die off because they notice that one of their brethren has gone
away.  PBS does not know about these processes on other nodes since they
were started via rsh, and can not know to kill them off.

With mpiexec, PBS itself starts all the processes in the parallel job,
thus when it notices that you have gone beyond your walltime, it can
kill off each process individually, with no mess and no fuss.  This
ensures that you don't get runaway processes due to code bugs, for one
thing, and also accounts for CPU and other resources used by the entire
job, not just process number zero.

Any recommendations for changes to that warning?  Is any of this too
surprising and requires documentation somewhere?

		-- Pete



More information about the mpiexec mailing list