FW: Mpiexec intermittent problem
Pete Wyckoff
pw at osc.edu
Mon Aug 28 17:16:05 EDT 2006
briadam at sandia.gov wrote on Mon, 28 Aug 2006 14:30 -0600:
> Thanks for the suggestions. In this case I am the DAKOTA user (and one
> of the newest DAKOTA developers), so I will explore your ideas directly.
> A side note: overall the mpiexec tiling capability you introduced for us
> in v0.80 has been working wonderfully with DAKOTA and single or
> multiprocessor tiled analysis jobs.
Lucky you to debug all this. :) Hope we see something more
exciting in your verbose log files next time around.
> The jobs I'm having DAKOTA launch take from 5 to 500 seconds to execute
> (and for a particular DAKOTA run are nearly homogeneous in run time), so
> I don't know if the race condition you're describing is likely?
That sounds like more than enough time. The contests.pl script
generates random runtimes in 100ms..500ms to simulate "fast"
programs.
> I ran contests.pl (from your SVN repo) against the installed version of
> mpiexec (Version 0.80+20050801 , configure options:
> '--prefix=/apps/mpiexec-cvs' '--with-pbs=/apps/torque'
> '--with-default-comm=ib') and against v0.81, configured the same. The
> results were a little different and in the 0.80 case, the perl script
> fails to exit. I'm attaching a tarball with the output of the tests
> with and without "-v -v" -- maybe you'll see something suspicious that
> indicates I should run 0.81 to avoid this problem...
There were definitely some bugs fixed between 0.80 and 0.81 that
explain the two oddities in your 0.80 log files:
Processes that exit(1) should propagate that value to the
environment correctly.
Killing the server results in a non-zero exit value when
there are client tasks still running.
But neither of these should affect your problem. Never hurts to run
the more recent, if you have that option; there were lots of
architectural changes between 0.80 and 0.81 that improved internally
how concurrent tasks are handled. There are a few more changes to
SVN head, but I don't spot anything that would fix your issues.
> Tbird uses torque-2.0.0p8 by default.
I'm hesitant to tell you to update because I don't know of any
particular problems. My testing is on torque-2.1.1 and I can't get
anything to fail, but my machine is teensy compared to yours.
-- Pete
More information about the mpiexec
mailing list