WRF MPICH jobs don't terminate
Jan Ploski
Jan.Ploski at offis.de
Mon Oct 1 11:38:45 EDT 2007
Pete Wyckoff <pw at osc.edu> schrieb am 10/01/2007 04:34:58 PM:
> Jan.Ploski at offis.de wrote on Mon, 01 Oct 2007 10:07 +0200:
> > I have a problem with running WRF jobs compiled using MPICH from PGI
> > 6.2-5. Frequently, a job with a bigger number of processors (say, 84)
will
> > not terminate at all. Although the rsl.out.0000 log file contains the
> > 'SUCCESS COMPLETE WRF' line (which is the last output line produced by
> > wrf.exe), the job remains in state 'Running'. On the target node, I
can
> > see mpiexec processes which don't consume any CPU:
> >
> > jploski 5420 0.0 0.0 5656 916 ? S 09:47 0:00
mpiexec
> > -np 84 /data/nwp/output_files/wrf/jploski/katrina/case_2_95_84/wrf.exe
> > jploski 5422 0.0 0.0 5656 476 ? S 09:47 0:00
mpiexec
> > -np 84 /data/nwp/output_files/wrf/jploski/katrina/case_2_95_84/wrf.exe
> >
> > These hang-ups don't occur when I run jobs with WRF compiled with
MVAPICH.
> >
> > My question is: could it be an mpiexec problem, and if so, how could I
> > diagnose it in more detail (e.g. any specific code locations to
instrument
> > with debugging output?)
>
> This sounds like it might be PBS losing OBIT messages. We've seen
> similar things in the past. Would be nice to know what version of
> PBS you are using.
I'm running TORQUE 2.1.6.
> You can use "mpiexec -v -v" to see all the messages that mpiexec
> gets from PBS, and it will tell you what tasks it is waiting for.
Ok, I added these options to my scripts and submitted some jobs, now it's
time to wait until they hang again...
> Check all the nodes and make sure no wrf is still running. Also
> check the PBS mom logs, comparing the ones mpiexec is waiting for
> against others, to see if you can find a pattern that show something
> different is happening on those nodes.
Good tips, thank you.
Best regards,
Jan Ploski
--
Dipl.-Inform. (FH) Jan Ploski
OFFIS
Betriebliches Informationsmanagement
Escherweg 2 - 26121 Oldenburg - Germany
Fon: +49 441 9722 - 184 Fax: +49 441 9722 - 202
E-Mail: Jan.Ploski at offis.de - URL: http://www.offis.de
More information about the mpiexec
mailing list