Frustrating problem

Pete Wyckoff pw at osc.edu
Fri Dec 17 09:34:39 EST 2004


bclem at rice.edu wrote on Fri, 17 Dec 2004 05:32 -0600:
> We are having an issue right now that occurs everytime we run mpi jobs
> using mpiexec 0.77. We are using mpich-1.2.6 with the mpich-p4 comm
> 
> The first time we run the mpi program using mpiexec
> 
> We get the following errors(and it's random processes/nodes each first
> time)
> Process 26 of 50 on n96.rtc
> p26_1111:  p4_error: Timeout in establishing connection to remote process:
> 0
> Process 10 of 50 on n118.rtc
> p10_1393:  p4_error: Timeout in establishing connection to remote process:
> 0
> Process 42 of 50 on n75.rtc
> 
> During the same job session, if we run the exact same command it runs
> fine.(ie we run the exact command again right after the first time
> command)
> 
> It's very wierd...and our users are starting to complain. At this point I
> don't know what's causing the problem. Our systems analyst keeps pointing
> to mpiexec(that's why I'm emailing the list).

I think the executables have all started up at this point.  They've even
managed to connect back to the mpiexec process to be able to deliver
these errors, so that phase is over.  These errors come from deep in
mpich/mpid/ch_p4/p4/lib/p4_sock_conn.c when one process tries to send a
message to another one for the first time, and tries to establish the
initial connection.

There are about six instances of the same error message, though,
surrounded by a maze of #ifdefs, so it is not clear where the problem
lies.  Presumably you're not using MPD nor are you using the threaded
listener.  That leaves three spots in the code that say this, all
related to message passing.

You might try running with -p4dbg 70 -p4rdbg 70 to get more debugging
info.  That "0" at the end of the error line doesn't actually mean it's
trying to talk to process 0, for instance, but the extra debugging might
say who is the slow responder.  Other things to think about are network
congestion or errors.  Check netstat -i for errors on all nodes.

Frustrating if it doesn't happen regularly though.

		-- Pete



More information about the mpiexec mailing list