FW: Mpiexec intermittent problem

Adams, Brian M briadam at sandia.gov
Mon Aug 28 13:07:09 EDT 2006


I'm forwarding a help request and follow-up initially sent to Sandia's
thunderbird help in case you have any additional insights.  (I realize
the directory paths are only of use for someone with an account on that
machine, so please advise if I should send any files or output.)

Brian
-------------------------------------------
Brian M. Adams, Ph.D. (briadam at sandia.gov)
Optimization and Uncertainty Estimation
Sandia National Laboratories
P.O. Box 5800, Mail Stop 1318
Albuquerque, NM 87185-1318
Voice: 505-284-8845, FAX: 505-284-2518
 
----- MESSAGE 1 -----
I am trying to diagnose an intermittent mpiexec problem on thunderbird
(also observed on liberty) and am hoping you might have some ideas.  I
am using DAKOTA in conjunction with the new mpiexec tiling feature to
farm analysis jobs out to compute nodes.  Here are some details:

My qsub script includes the following (see, e.g.,
/scratch3/briadam/msopt/quad_8_100/sub1/qsub1.in):
-----
# start the mpiexec in server mode to allow for job tiling
/projects/sysapps/mpiexec/bin/mpiexec -server -verbose &
# launch DAKOTA
/projects/dakota/TBIRD/bin/dakota -i dakota_opt.in 2>&1 1>dakota.out
-----
During execution, DAKOTA will repeatedly call a shell script (see, e.g.
/scratch3/briadam/msopt/quad_8_100/sub1/tapered_ss.sh) to perform finite
element analysis with Aria.  (In fact DAKOTA will invoke this up to
NUM_NODES*PPN simultaneously to launch tiled analysis jobs on all the
processors in the PBS allocation.) The key parts of this are:
-----
MPI_BIN=/projects/sysapps/mpiexec/bin/mpiexec
BIN_PATH=/home/briadam/bin/aria_2006_08_23

num=<simulationcallnumber>
mkdir workdir.$num
cd workdir.$num

# Preprocessing to create necessary input files 
...
# Call custom SIERRA/Aria binary for analysis -- this gets launched on
an available compute node within the processor allocation:
$MPI_BIN -n 1 $BIN_PATH/aria_64ws32bit-gnu3.4.3ip_dp_opt.x -d `pwd`/ -p
$BIN_PATH/sierra.xmldb -i fcbm_input.i -o fcbm_input.log

# Postprocessing of output
-----

This works fine for typically 100s or even 1000s of calls to aria, but
then will randomly hang.  What I find when hanging is that aria runs to
completion (at least according to fcbm_input.log), but that mpiexec
never exits.  An example of a working directory in this state can be
seen at /scratch3/briadam/msopt/quad_8_100/sub1/workdir.746

If I manually kill the relevant mpiexec process, the postprocessing
completes as expected and DAKOTA continues merrily along.

Any ideas on how I can chase this down?

----- MESSAGE 2 -----
I neglected to mention that when mpiexec for the aria analysis does
successfully exit, I get the following warning:

  mpiexec: Warning: task 0 exited before completing MPI startup.

(So this occurs on many successful calls before one fails.)  See the
following for context:

  /scratch3/briadam/msopt/quad_8_100/sub1/qsub1.in.o105584
  /scratch3/briadam/msopt/quad_8_100/sub1/qsub1.in.e105584




More information about the mpiexec mailing list