[MPICH2 Req #3996] MPI_Abort not stopping code

Pavan Balaji balaji at mcs.anl.gov
Tue Mar 25 15:42:31 EDT 2008


Ah, ha. I was wrong. It doesn't in fact hand over to the process 
manager, while it should. A work around might be possible to pass it to 
the process manager for cases where the PM supports it, while blindly 
exiting in cases where it doesn't. I'll discuss this with Bill and see 
what can be done within PMI-1, before PMI-2 is ready.

  -- Pavan

On 03/25/2008 02:29 PM, Pete Wyckoff wrote:
> balaji at mcs.anl.gov wrote on Mon, 17 Mar 2008 16:58 -0500:
>> On 03/17/2008 04:24 PM, Grismer, Matthew J Civ USAF AFMC AFRL/RBAC wrote:
>>> I've compiled MPICH2 1.0.7rc1 under Mac OS X 10.4.11 using the Intel
>>> 10.1 C/C++ and Fortran compilers, and use mpiexec from OSC to run the
>>> executables. Everything runs fine, except when the code attempts to
>>> abort gracefully because of an error using MPI_Abort. The MPI process
>>> that calls the abort stops, but none of the other MPI processes stop.
>>> Any suggestions?
>> MPICH2 CH3:sock just hands over the abort call to the process manager using 
>> PMI_Abort, which performs the actual cleanup. I'm CC'ing Pete Wyckoff @ OSC 
>> about this.
>>
>> Pete -- I can reproduce this problem with mpiexec-v0.83 using a simple 
>> program that does:
>>
>> MPI_Init
>> if (!rank) MPI_Abort
>> MPI_Finalize
>>
>> Also, this doesn't occur with MPD.
> 
> I complained about this to mpich2-maint back in July 2006.  You can
> read the thread in req #2626.  Bill said there that handling abort
> will be on the list of plans for the next gen PMI.
> 
> In the simple PMI, which both mpiexec and MPD use, PMI_Abort() just
> does exit().  What it should do is send a PMI message to the process
> starter (mpiexec), telling it that the process called abort.
> Something like in the PMI_Finalize handler directly above it in the
> mpich2 code.  A very small amount of code would be required here.
> 
> The way MPD appears to work is that it kills everything if any one
> process dies.  This is not always the correct behavior and has
> caused us problems in the past on the production machines.  In
> particular, consider the case where tasks > 0 exit cleanly but task
> #0 continues to write a big datafile.  With this behavior, it will
> be killed prematurely when the others exit.
> 
> Mpiexec kills everything if any process dies with a signal (like
> SEGV).  It doesn't kill everything if one exits with exit(0).  Nor
> does it kill everything if one exits with exit(1) or any other
> non-zero status.  We toyed with the idea of changing that last
> behavior, though, but found many fortran and C programs that fall
> off the end of main() and return a random exit status.
> 
> Matt:  you can use "mpiexec -kill" to destroy the other processes on
> abort (and normal exit!) if that's what you want.
> 
> 		-- Pete
> 

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpiexec mailing list