Bug or strange results in runtests.pl

Pete Wyckoff pw at osc.edu
Wed Nov 15 15:16:32 EST 2006


annaj at hi.is wrote on Mon, 13 Nov 2006 15:41 +0000:
> The error appears during the following tests: 
> mpiexec --comm=pmi hello -sleep -abort 0
[..]
> [cli_0]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 4269) - process 0
> mpiexec: Warning: task 0 exited with status 173.
> mpiexec: Warning: task 1 exited oddly---report bug: status 0 done 0.
> mpiexec: Warning: task 2 exited with status -767359304.
> mpiexec: Warning: task 3 exited with status 59.
> =>> PBS: job killed: walltime 308 exceeded limit 300

Sorry for the late reply.  This "hello -abort" test is just part of
the testsuite to look for regressions in mpiexec as we develop new
features and fix bugs.  I wouldn't worry too much if something
doesn't seem to work right, but thanks for pointing it out.

This particular test has one task call MPI_Abort().  The
specification leaves unspecified exactly what this is supposed to
do, but a "good" implementation would try to terminate all the other
tasks in the communicator.  Mpich1/p4 and others did that just fine,
but mpich2 does not.  I've asked the mpich2 developers if they
were planning to fix this at some point, but they have so far not
done so.

What mpiexec does is to notice that one task died, but it doesn't
kill all the others, in case you called exit() on purpose but
expected the rest to keep going.  It _would_ kill all the others if
the task died with a signal, like a SIGSEGV, but the abort
implementation in mpich2 just calls exit().  For a real application,
you can add "-kill" to the mpiexec command line to force it to
terminate all the other tasks when any one exits.

In this test, then, the other three just hang around waiting for
something to happen until the job's walltime allocation runs out.
(The "oddly" messages and random status codes were a bug I fixed a
while back.)  I'll put a warning in the runtests.pl code pointing
out that this will happen on mpich2 until the mpich2 developers
decide to fix it.

		-- Pete


More information about the mpiexec mailing list