FW: Mpiexec intermittent problem

Pete Wyckoff pw at osc.edu
Mon Aug 28 17:16:05 EDT 2006


briadam at sandia.gov wrote on Mon, 28 Aug 2006 14:30 -0600:
> Thanks for the suggestions.  In this case I am the DAKOTA user (and one
> of the newest DAKOTA developers), so I will explore your ideas directly.
> A side note: overall the mpiexec tiling capability you introduced for us
> in v0.80 has been working wonderfully with DAKOTA and single or
> multiprocessor tiled analysis jobs.

Lucky you to debug all this.  :)  Hope we see something more
exciting in your verbose log files next time around.

> The jobs I'm having DAKOTA launch take from 5 to 500 seconds to execute
> (and for a particular DAKOTA run are nearly homogeneous in run time), so
> I don't know if the race condition you're describing is likely?

That sounds like more than enough time.  The contests.pl script
generates random runtimes in 100ms..500ms to simulate "fast"
programs.

> I ran contests.pl (from your SVN repo) against the installed version of
> mpiexec (Version 0.80+20050801 , configure options:
> '--prefix=/apps/mpiexec-cvs' '--with-pbs=/apps/torque'
> '--with-default-comm=ib') and against v0.81, configured the same.  The
> results were a little different and in the 0.80 case, the perl script
> fails to exit.  I'm attaching a tarball with the output of the tests
> with and without "-v -v" -- maybe you'll see something suspicious that
> indicates I should run 0.81 to avoid this problem...

There were definitely some bugs fixed between 0.80 and 0.81 that
explain the two oddities in your 0.80 log files:

    Processes that exit(1) should propagate that value to the
    environment correctly.

    Killing the server results in a non-zero exit value when
    there are client tasks still running.

But neither of these should affect your problem.  Never hurts to run
the more recent, if you have that option; there were lots of
architectural changes between 0.80 and 0.81 that improved internally
how concurrent tasks are handled.  There are a few more changes to
SVN head, but I don't spot anything that would fix your issues.

> Tbird uses torque-2.0.0p8 by default.

I'm hesitant to tell you to update because I don't know of any
particular problems.  My testing is on torque-2.1.1 and I can't get
anything to fail, but my machine is teensy compared to yours.

		-- Pete


More information about the mpiexec mailing list