mpiexec svn and mvapich 0.97 failure to start jobs
Pete Wyckoff
pw at osc.edu
Wed Mar 15 16:55:22 EST 2006
jtang at tchpc.tcd.ie wrote on Wed, 15 Mar 2006 18:08 +0000:
> With the announcement of mvapich 0.97 in the last day or so, I decided
> that it might be nice to run/install the latest and greatest version of mvapich
> and mpiexec for our infiniband cluster.
>
> mpiexec svn checkout - 20060315
> mvapich 0.97
> our IB stack is the voltaire shipped ibhost-3.x
> (the current release 3.5.0 i think)
>
> I tried compiling up cpi.c with 0.97 of mvapich using the latest mpiexec
> version and I get this....
>
>
> ------------------------------------------------------------------------------
> 17:53:15 jtang at iitac128 ~/sandbox/mpiexec-tests $
> /usr/support/mpiexec-svn-20060315/bin/mpiexec -verbose ./cpi-97
> mpiexec: resolve_exe: using absolute exe "./cpi-97".
> connect: Connection refused
> mpiexec: process_start_event: evt 2 task 0 on iitac128.ib.tchpc.tcd.ie.
> mpiexec: All 1 task (spawn 0) started.
> mpiexec: read_ib_startup_ports: waiting for checkin: 1 to accept, 0 to
> read.
> mpiexec: process_obit_event: evt 3 task 0 on iitac128.ib.tchpc.tcd.ie
> stat 1.
> mpiexec: kill_tasks: killing all tasks.
> mpiexec: Warning: task 0 exited with status 1.
> ------------------------------------------------------------------------------
This "connect: Connection refused" message comes from the MPI task,
not from mpiexec. I have seen this before, unfortunately. Looking
at mvapich-0.9.7, it is clear that they included a misguided patch
from NERSC that the Mellanox people picked up. I told the original
author and Mellanox that it was a bad idea, almost a year ago.
Looks like OSU incorporated it anyway, not just once, but
_three_times_ in quick successsion. Look in
mpid/vapi/process/pmgr_client_mpirun_rsh.c for this comment.
/*
* Route stdout and stderr to mpiexec if applicable
[..]
You will find 103 lines of code that must be deleted. Then you will
find the same 103 lines again, and again. Delete all three
identical sections.
Hopefully the remaining protocol is the same as earlier versions. I
did not look at 0.9.6. Let me know if you still have problems, and
please complain to the mvapich authors how wrong this is.
A FAQ entry (#10) on why this patch is bad has been here for a
while:
http://svn.osc.edu/repos/mpiexec/trunk/README
and now it applies to mvapich-0.9.7 as well as the Mellanox MPI
releases. The short answer is that PBSPro doesn't include the
now-ancient mpiexec patch, so one user tried to put that
functionality into MPI instead. Unfortunately, doing so breaks
OpenPBS and Torque installations badly.
-- Pete
More information about the mpiexec
mailing list