mpiexec and myrinet
Pete Wyckoff
pw at osc.edu
Wed Nov 13 09:41:16 EST 2002
tmerritt at email.arizona.edu said:
> I have mpich-gm 1.2.4..8a, and mpiexec 0.70 running with pbspro 5.2.0 on
> a 32 node 2 way cluster. It seems to run fine in most cases but I have
> a strange problem that I was hoping someone could shed some light on.
> When I run mm5 on 4 nodes, 2ppn via mpiexec, it runs fine. When I run
> it on 16 nodes, 1ppn, it runs fine. When I run it on 8 nodes, 2ppn, one
> of the processes terminates, inexplicably. If I run it with mpirun, it
> runs fine. I'm quite baffled, then environment for the 2 runs looks
> similar as far as the GMPI variables go. Running with mpiexec -verbose
> yields little help, here is the output though:
>
> read_gm_startup_ports: waiting for info
> read_gm_startup_ports: id 2 port 2 board 0 gm_node_id 34 pid 14555
> read_gm_startup_ports: id 4 port 2 board 0 gm_node_id 32 pid 6808
> read_gm_startup_ports: id 10 port 4 board 0 gm_node_id 34 pid 14560
> read_gm_startup_ports: id 12 port 4 board 0 gm_node_id 32 pid 6813
> read_gm_startup_ports: id 6 port 2 board 0 gm_node_id 28 pid 5622
> read_gm_startup_ports: id 3 port 2 board 0 gm_node_id 31 pid 2477
> read_gm_startup_ports: id 9 port 2 board 0 gm_node_id 30 pid 3174
> read_gm_startup_ports: id 14 port 4 board 0 gm_node_id 28 pid 5625
> read_gm_startup_ports: id 0 port 2 board 0 gm_node_id 29 pid 21571
> read_gm_startup_ports: id 11 port 4 board 0 gm_node_id 31 pid 2480
> read_gm_startup_ports: id 1 port 4 board 0 gm_node_id 30 pid 3171
> read_gm_startup_ports: id 5 port 2 board 0 gm_node_id 33 pid 979
> read_gm_startup_ports: id 7 port 2 board 0 gm_node_id 27 pid 2044
> read_gm_startup_ports: id 8 port 4 board 0 gm_node_id 29 pid 21572
> read_gm_startup_ports: id 13 port 4 board 0 gm_node_id 33 pid 984
> read_gm_startup_ports: id 15 port 4 board 0 gm_node_id 27 pid 2047
> wait_tasks: numspawned = 16, got evt 30 for tid 14 host node028 status
> 267
> wait_tasks: numspawned = 15, got evt 31 for tid 15 host node027 status
> 267
> wait_tasks: numspawned = 14, got evt 26 for tid 10 host node032 status
> 267
> wait_tasks: 8 stray obit 0 while waiting for kill 26
The read_gm_startup_ports looks fine. Note though that the tids go
across the eight nodes first for 0..7, then across them again for 8..15.
I would have expected procs 0 and 1 to be on node 0, not procs 0 and 8.
Not sure if that matters or not. (PBS decides this layout by setting
the node list string you see in "qstat -an". It's usually something
like "node05/0+node05/1+..." with the procs in order here at least.)
The wait_tasks lines report that PBS claims that tasks 14, 15, 10, died
with SEGV. 267 == 0x10b: PBS's way of saying unnatural exit with SIGSEGV.
That last line about "stray obit" is completely bizarre. When mpiexec
notices that some task dies AND you have started it with "-kill", it
waits 5 seconds then sends SIGKILL to the rest of the tasks. There
might be a race condition between tasks dying off and us trying to kill
them, so this message normally means that happened (and is harmless).
But it appears that you did not use -kill since that obit event #0 is
invalid. In that case it's just reporting that PBS delivered event #0
again. It's not supposed to do that.
Anyway the messages are misleading and I'll make sure that zero is not
being used by PBS for an event number and make a patch.
But I don't think it will do anything for your problem. The SEGV from
some of the tasks is pretty conclusive there's an app or mpich or gm
problem. Weird that the code works under mpirun. Can you find which
task-to-node arrangement mpirun uses and make mpiexec use that to see if
that is indeed the issue? Use a "-config" file with perhaps:
node01 : mycode
node01 : mycode
node02 : mycode
node02 : mycode
...
to force a layout order no matter what PBS says. You'll have to figure
out the node names in the script, or run interactive, to get this to
work.
More "-v" switches to mpiexec produce more verbosity, although I don't
suspect you'll see anything more.
-- Pete
More information about the mpiexec
mailing list