Mpiexec launching past 1536 nodes?

Maestas, Christopher Daniel cdmaest at sandia.gov
Mon Sep 19 08:36:20 EDT 2005


Hello,

We've been testing mpiexec to launch jobs > 1536 processes.
This is with code from cvs since we are running on ib and a later
mvapich.
The same output occurs w/ .80 as well.
On startup I see
---
read_ib_startup_ports: waiting for checkins
read_ib_startup_ports: version 3 startup
read_ib_startup_ports: rank 0 checked in, 2048 left
... #A bunch of the above line, decrementing the num of processes left
read_ib_startup_ports: rank 785 checked in, 56 left
read_ib_startup_ports: rando_child: poll got 1
do_child:  listeners are -1 -1 -1
do_child: output from process at 1026 type ERR, 0 select bits left
readsome: fd 1026 has 30 bytes
aggregate_output: output to stream ERR from NODE_IP: do_child: poll got
2
... # a bunch of the above 4 lines for each NODE_IP that started
---

We find that it hangs on startup with tons of messages:
---
mpiexec: process_obit_event: evt EVTID task TASKNAME on NODENAME stat 1

I was thinking it was related to the usleep used in ib.c, 
but changing it higher or lower doesn't seem to help.

The pbs mom/server logs don't say much as well.
Any clue on your end?  I have an strace output, that is huge! :-)

- Chris



More information about the mpiexec mailing list