Trouble with all jobs launched on same node.
Pete Wyckoff
pw at osc.edu
Thu Apr 8 13:01:10 EDT 2004
james.becnel at srs.gov said on Thu, 08 Apr 2004 10:58 -0400:
> Problem: For example, on a 4 node run, although OpenPBS allocates 4
> separate nodes, they all launch on the first node given in the PBS node
> list.
>
> Any ideas why? I have been exploring the code to try to find a problem,
> but it looks to be where mpiexec gets the information from OpenPBS. Let
> me know if you have any thoughts. Maybe I need to be looking at the code
> on the OpenPBS side instead? Can you suggest any workarounds? I have no
> problems coding in C. Thank you!
Very bizarre. The qstat says you have one processor on each of four
different nodes. But mpiexec for some reason wants to allocate
the same node/cpu four times. This is supposed never to be possible.
Could you do "qstat -f <jobid>" and show me the exec_host line and all
the Resource_List lines. Let's see what PBS is feeding to the
get_hosts() routine in mpiexec. You could step through the code in
get_hosts() to see what it's doing if nothing obviously wrong appears
in the PBS output, but I'm suspicious something odd is happening at that
interface.
> mpiexec-0.75 compile options: --with-comm=shared
> --with-default-comm=mpich-p4
The --with-comm doesn't do anything to mpiexec. Maybe you meant to
feed that to your MPICH build. But it's not the problem now. Just for
giggles, do "mpiexec -version" to see how it thinks it was compiled.
And the entire output of "mpiexec -v -v myjob" would be handy too.
-- Pete
More information about the mpiexec
mailing list