WRF MPICH jobs don't terminate

Pete Wyckoff pw at osc.edu
Mon Oct 1 10:34:58 EDT 2007


Jan.Ploski at offis.de wrote on Mon, 01 Oct 2007 10:07 +0200:
> I have a problem with running WRF jobs compiled using MPICH from PGI 
> 6.2-5. Frequently, a job with a bigger number of processors (say, 84) will 
> not terminate at all. Although the rsl.out.0000 log file contains the 
> 'SUCCESS COMPLETE WRF' line (which is the last output line produced by 
> wrf.exe), the job remains in state 'Running'. On the target node, I can 
> see mpiexec processes which don't consume any CPU:
> 
> jploski   5420  0.0  0.0   5656   916 ?        S    09:47   0:00 mpiexec 
> -np 84 /data/nwp/output_files/wrf/jploski/katrina/case_2_95_84/wrf.exe
> jploski   5422  0.0  0.0   5656   476 ?        S    09:47   0:00 mpiexec 
> -np 84 /data/nwp/output_files/wrf/jploski/katrina/case_2_95_84/wrf.exe
> 
> These hang-ups don't occur when I run jobs with WRF compiled with MVAPICH.
> 
> My question is: could it be an mpiexec problem, and if so, how could I 
> diagnose it in more detail (e.g. any specific code locations to instrument 
> with debugging output?)

This sounds like it might be PBS losing OBIT messages.  We've seen
similar things in the past.  Would be nice to know what version of
PBS you are using.

You can use "mpiexec -v -v" to see all the messages that mpiexec
gets from PBS, and it will tell you what tasks it is waiting for.

Check all the nodes and make sure no wrf is still running.  Also
check the PBS mom logs, comparing the ones mpiexec is waiting for
against others, to see if you can find a pattern that show something
different is happening on those nodes.

		-- Pete


More information about the mpiexec mailing list