mpiexec jobs hang up into sleep state
Milos Milosavljevic
milos at astro.as.utexas.edu
Fri Jul 25 12:29:18 EDT 2008
Hi,
I am running mpiexec on RH Enterprise 5 with mpich2-64. The system
is a 4 x Quad Core X7350 Xeon. I run the executable in the
background as follows (the same hangup occurs when the job is run in
the foreground with no redirects):
mpiexec -n 16 executable < /dev/null > screen.out &
After several minutes or hours of uninterrupted execution with good
load balance, the job suddenly hangs up in the sleep state (the
status of all associated processes in top goes from 'R' to 'S').
I am able to revive the job simply with 'fg ctrl-z bg'. Then the
job continues for another few minutes or hours, until it hangs up again.
The output of the calculation is correct and unaffected by the hangup
and revival. The final results of the calculation look accurate
regardless of how many times it hangs up and gets revived by hand, so
the hangup does not seem to be triggered from within the simulation.
It is also uncorrelated with the disk write sequence, etc.
Thank you in advance for any help with this problem,
Milos
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://email.osc.edu/pipermail/mpiexec/attachments/20080725/b4873bca/attachment.htm
More information about the mpiexec
mailing list