FW: Mpiexec intermittent problem
Adams, Brian M
briadam at sandia.gov
Tue Aug 29 11:18:04 EDT 2006
> -----Original Message-----
> From: Pete Wyckoff [mailto:pw at osc.edu]
> Sent: Monday, August 28, 2006 3:16 PM
>
> There were definitely some bugs fixed between 0.80 and 0.81
> that explain the two oddities in your 0.80 log files:
>
> Processes that exit(1) should propagate that value to the
> environment correctly.
>
> Killing the server results in a non-zero exit value when
> there are client tasks still running.
>
> But neither of these should affect your problem. Never hurts
> to run the more recent, if you have that option; there were
> lots of architectural changes between 0.80 and 0.81 that
> improved internally how concurrent tasks are handled. There
> are a few more changes to SVN head, but I don't spot anything
> that would fix your issues.
I'll next try rerunning my case with 0.81, just to see if there's any
difference. The hang pops up every few hours on this run I'm testing
with, so it's not too painful to reproduce. I want to make sure I built
mpiexec the same way as the sysadmins, and then I can test.
--comm=none fixed the warning. The aria binary we're using is
MPI-enabled for parallel analysis, but we don't need more than one CPU
per analysis -- they somehow built it to also run in serial outside an
MPI environment, which must be what we're seeing. I'm only mpiexec-ing
it to take advantage of the concurrent parallelism/server mode.
I'm attaching the verbose ("-v -v") output from both mpiexec server and
clients for an actual failed simulation run. When DAKOTA system calls
the simulator script containing the client mpiexec calls, there is no
std out nor std err from mpiexec itself when I redirect to a file, so
it's all interspersed in the dakota.out files.
This run went through the first 765 model evaluations, then scheduled
766--770, and 769 hung. At this point I created mpiexec_server.out.769
and dakota.out.769. I went to the zero node of the PBS allocation and
created headnode_ps.769, showing the still running mpiexec server and
the defunct mpiexec client. Killing process 20385 caused the rest of
the 769th simulator script to run and DAKOTA to keep going until
evaluation 1495, when I did a qdel and created the *.qdel versions of
the output files.
I suppose it's possible that the aria binary which I'm mpiexec-ing isn't
properly exiting and mpiexec isn't to blame. If you have any ideas for
testing that possibility, I welcome them.
Brian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tbird_mpiexec_logs.tgz
Type: application/x-compressed
Size: 398837 bytes
Desc: tbird_mpiexec_logs.tgz
Url : http://email.osc.edu/pipermail/mpiexec/attachments/20060829/ca64744d/tbird_mpiexec_logs-0001.bin
More information about the mpiexec
mailing list