mpiexec jobs hang up into sleep state
Pete Wyckoff
pw at osc.edu
Wed Jul 30 12:08:22 EDT 2008
milos at astro.as.utexas.edu wrote on Fri, 25 Jul 2008 11:29 -0500:
> I am running mpiexec on RH Enterprise 5 with mpich2-64. The system is a
> 4 x Quad Core X7350 Xeon. I run the executable in the background as
> follows (the same hangup occurs when the job is run in the foreground
> with no redirects):
>
> mpiexec -n 16 executable < /dev/null > screen.out &
>
> After several minutes or hours of uninterrupted execution with good load
> balance, the job suddenly hangs up in the sleep state (the status of all
> associated processes in top goes from 'R' to 'S').
>
> I am able to revive the job simply with 'fg ctrl-z bg'. Then the job
> continues for another few minutes or hours, until it hangs up again.
>
> The output of the calculation is correct and unaffected by the hangup
> and revival. The final results of the calculation look accurate
> regardless of how many times it hangs up and gets revived by hand, so
> the hangup does not seem to be triggered from within the simulation. It
> is also uncorrelated with the disk write sequence, etc.
That is interesting. I tried reproducing this, but I'm guessing it
must be dependent on how the executable tries to read from stdin.
My tests either continue to finish, or exit immediately when they
find that stdin (/dev/null) is empty.
Does your shell print anything out, like:
[1]+ Stopped mpiexec ...
that would indicate that the process got a SIGTSTP due to having
no controlling terminal.
If you can find any interesting output in "mpiexec -v -v -v" and
can correlate events in the log with the times when the executable
goes to sleep, that would be very helpful.
Also try doing "strace -p <pid>" on one of the tasks on one of the
compute nodes to try to figure out what they are waiting for. Maybe
try task #0 and task #12 (or some other non-zero).
You also might experiment with "-nostdin" instead of redirecting in
from /dev/null. That should do the same thing, though.
-- Pete
More information about the mpiexec
mailing list