Frustrating problem

Brent M. Clements bclem at rice.edu
Tue Dec 21 08:44:12 EST 2004


Pete, I think it may be an issue either with the ethernet components or
the machines themselves. I have removed the machines which "timeout" from
the queues and have restarted the jobs. The jobs run perfectly now.

Thanks for the insight, I would have next guessed a hardware/software
issue.

Have a great holiday season.

-Brent

Brent Clements
Linux Technology Specialist
Information Technology
Rice University

Linux at Rice news and information
available only at http://linuxsupport.rice.edu


On Fri, 17 Dec 2004, Pete Wyckoff wrote:

> bclem at rice.edu wrote on Fri, 17 Dec 2004 05:32 -0600:
> > We are having an issue right now that occurs everytime we run mpi jobs
> > using mpiexec 0.77. We are using mpich-1.2.6 with the mpich-p4 comm
> >
> > The first time we run the mpi program using mpiexec
> >
> > We get the following errors(and it's random processes/nodes each first
> > time)
> > Process 26 of 50 on n96.rtc
> > p26_1111:  p4_error: Timeout in establishing connection to remote process:
> > 0
> > Process 10 of 50 on n118.rtc
> > p10_1393:  p4_error: Timeout in establishing connection to remote process:
> > 0
> > Process 42 of 50 on n75.rtc
> >
> > During the same job session, if we run the exact same command it runs
> > fine.(ie we run the exact command again right after the first time
> > command)
> >
> > It's very wierd...and our users are starting to complain. At this point I
> > don't know what's causing the problem. Our systems analyst keeps pointing
> > to mpiexec(that's why I'm emailing the list).
>
> I think the executables have all started up at this point.  They've even
> managed to connect back to the mpiexec process to be able to deliver
> these errors, so that phase is over.  These errors come from deep in
> mpich/mpid/ch_p4/p4/lib/p4_sock_conn.c when one process tries to send a
> message to another one for the first time, and tries to establish the
> initial connection.
>
> There are about six instances of the same error message, though,
> surrounded by a maze of #ifdefs, so it is not clear where the problem
> lies.  Presumably you're not using MPD nor are you using the threaded
> listener.  That leaves three spots in the code that say this, all
> related to message passing.
>
> You might try running with -p4dbg 70 -p4rdbg 70 to get more debugging
> info.  That "0" at the end of the error line doesn't actually mean it's
> trying to talk to process 0, for instance, but the extra debugging might
> say who is the slow responder.  Other things to think about are network
> congestion or errors.  Check netstat -i for errors on all nodes.
>
> Frustrating if it doesn't happen regularly though.
>
> 		-- Pete
> _______________________________________________
> mpiexec mailing list
> mpiexec at osc.edu
> http://email.osc.edu/mailman/listinfo/mpiexec
>



More information about the mpiexec mailing list