Mpiexec launching past 1536 nodes?

Maestas, Christopher Daniel cdmaest at sandia.gov
Mon Sep 19 23:39:09 EDT 2005


Thanks for the feedback Pete.
I tried doing 1024 nodes at launching 2048 processes, and that works
fine with -nostdin and -nostdout options.

I think it is a timeout with > 1536 pbs_mom processes.  Running with
nostdin and nostdout didn't help beyond this level.
The usleep didn't make a difference whether it was there, shorter or
longer.

We are doing some further digging.  We are running the latest torque
(1.2.0p6) with "pbs_mom -p"
The problem is tracing which pbs_mom is failing to connect and which
pbs_mom log to look at.  The failure seems consistent at a certain node
level.  

Another option we threw around is that it could be taking a long time to
read the mpi binary from nfs/panfs and timing out there.  We will also
look into the fast dist package and see if that helps out.


-----Original Message-----
From: Pete Wyckoff [mailto:pw at osc.edu] 
Sent: Monday, September 19, 2005 7:18 AM
To: Maestas, Christopher Daniel
Cc: mpiexec at osc.edu
Subject: Re: Mpiexec launching past 1536 nodes?

cdmaest at sandia.gov wrote on Mon, 19 Sep 2005 06:36 -0600:
> We've been testing mpiexec to launch jobs > 1536 processes.
> This is with code from cvs since we are running on ib and a later 
> mvapich.
> The same output occurs w/ .80 as well.
> On startup I see
> ---
> read_ib_startup_ports: waiting for checkins
> read_ib_startup_ports: version 3 startup
> read_ib_startup_ports: rank 0 checked in, 2048 left ... #A bunch of 
> the above line, decrementing the num of processes left
> read_ib_startup_ports: rank 785 checked in, 56 left
> read_ib_startup_ports: rando_child: poll got 1
> do_child:  listeners are -1 -1 -1
> do_child: output from process at 1026 type ERR, 0 select bits left
> readsome: fd 1026 has 30 bytes
> aggregate_output: output to stream ERR from NODE_IP: do_child: poll 
> got
> 2
> ... # a bunch of the above 4 lines for each NODE_IP that started
> ---

This looks okay:  it's a bit mixed up as you have debug messages from
the IB startup going on at the same time the stdio listener is getting
output from already-connected processes.  Output on the stderr stream,
in particular, here from task 1026.

> We find that it hangs on startup with tons of messages:
> ---
> mpiexec: process_obit_event: evt EVTID task TASKNAME on NODENAME stat 
> 1
> 
> I was thinking it was related to the usleep used in ib.c, but changing

> it higher or lower doesn't seem to help.

That is a normal message when a task exits, from the mpiexec point of
view.  It died on its own with exit(1) or similar.

> The pbs mom/server logs don't say much as well.
> Any clue on your end?  I have an strace output, that is huge! :-)

The territory out beyond 1024 has not been well explored and I would
expect bugs somewhere in there.  You've seen the discussion in stdio.c
around line 274?  It tries to make sure there are enough slots in the
system per-perocess file descriptor table to accomodate all the
connections to the stdio listener.  You might try "-nostdin -nostdout"
to see if that is the source of the problems.

In ib.c, one socket will be held open for each process, so 2048 in your
case.  There's no test against _SC_OPEN_MAX there (but maybe there
should be).  I don't think the sleep matters there.  It was just put in
to avoid a busy spin.  You might take it out entirely and see how things
change, but they probably won't.

The listening IB socket is put that way via "listen(mport_fd, 1024)".
Could the backlog be too small and some clients are getting
ECONNREFUSED?  You might strace one of the other pbs_mom processes with
-vFf to go into the task as it is spawned and see if it fails to
connect.  There would be a message in the mom log likely though.

		-- Pete




More information about the mpiexec mailing list