Frustrating problem

Brent M. Clements bclem at rice.edu
Fri Dec 17 06:32:47 EST 2004


Hi Pete et al,

We are having an issue right now that occurs everytime we run mpi jobs
using mpiexec 0.77. We are using mpich-1.2.6 with the mpich-p4 comm

The first time we run the mpi program using mpiexec

We get the following errors(and it's random processes/nodes each first
time)
Process 26 of 50 on n96.rtc
p26_1111:  p4_error: Timeout in establishing connection to remote process:
0
Process 10 of 50 on n118.rtc
p10_1393:  p4_error: Timeout in establishing connection to remote process:
0
Process 42 of 50 on n75.rtc

During the same job session, if we run the exact same command it runs
fine.(ie we run the exact command again right after the first time
command)

It's very wierd...and our users are starting to complain. At this point I
don't know what's causing the problem. Our systems analyst keeps pointing
to mpiexec(that's why I'm emailing the list).

Thanks,
Brent



More information about the mpiexec mailing list