unknown error
Pete Wyckoff
pw at osc.edu
Wed Jul 9 16:20:12 EDT 2003
stefan.friedel at iwr.uni-heidelberg.de said on Wed, 09 Jul 2003 11:38 +0200:
> some jobs are dying with
> ####################
> mpiexec: Warning: main: task 20 died with signal 9 (raw 0x109).
> mpiexec: Warning: main: task 28 died with signal 9 (raw 0x109).
> mpiexec: Warning: main: task 60 died with signal 7 (raw 0x107).
Two SIGKILL and one SIGBUS. I modified mpiexec to print symbolic
names for well known signals (or all, depending on architecture)
for future releases.
[..]
> All 64 tasks started.
> read_gm_startup_ports: waiting for info
> read_gm_startup_ports: id 5 port 2 board 0 gm_node_id 251 pid 11938
> read_gm_startup_ports: id 4 port 2 board 0 gm_node_id 252 pid 11905
> [... everything ok up to now]
> read_gm_startup_ports: id 60 port 4 board 0 gm_node_id 226 pid 11745
> read_gm_startup_ports: id 25 port 4 board 0 gm_node_id 230 pid 11238
> wait_tasks: numspawned = 64, got evt 126 for tid 62 host node226 status 263
> wait_tasks: task 60 tid 62 stray obit 0 while waiting for kill 126
> wait_tasks: numspawned = 63, got evt 130 for tid 2 host node256 status 0
> wait_tasks: numspawned = 62, got evt 162 for tid 34 host node256 status 0
> [...]
Looks good. The first "wait_tasks" line is the proces on node226 dying
with a SIGBUS. Since mpiexec notices something is not right, it sends
SIGKILL to the rest and waits for them to die. However some of them
were busy dying off on their own leading to the "stray obit" message.
It's a harmless race condition.
Bottom line: your code is dying and mpiexec is just trying to clean up.
-- Pete
More information about the mpiexec
mailing list