mpiexec svn and mvapich 0.97 failure to start jobs

Jimmy Tang jtang at tchpc.tcd.ie
Wed Mar 15 18:37:05 EST 2006


Hi Pete,

Thanks for the pointer on that block of code, i removed all three
duplicate sections and mpiexec worked as expected. I shall send the
mvapich and voltaire people an email to try and have that patch fixed
(removed) or at least I can complain about it :)

Thanks again,
Jimmy.


On Wed, Mar 15, 2006 at 04:55:22PM -0500, Pete Wyckoff wrote:
> jtang at tchpc.tcd.ie wrote on Wed, 15 Mar 2006 18:08 +0000:
> > With the announcement of mvapich 0.97 in the last day or so, I decided
> > that it might be nice to run/install the latest and greatest version of mvapich
> > and mpiexec for our infiniband cluster.
> > 
> >     mpiexec svn checkout - 20060315
> >     mvapich 0.97
> >     our IB stack is the voltaire shipped ibhost-3.x
> >        (the current release 3.5.0 i think)
> > 
> > I tried compiling up cpi.c with 0.97 of mvapich using the latest mpiexec
> > version and I get this....
> > 
> > 
> > ------------------------------------------------------------------------------
> > 17:53:15 jtang at iitac128 ~/sandbox/mpiexec-tests $
> > /usr/support/mpiexec-svn-20060315/bin/mpiexec  -verbose  ./cpi-97
> > mpiexec: resolve_exe: using absolute exe "./cpi-97".
> > connect: Connection refused
> > mpiexec: process_start_event: evt 2 task 0 on iitac128.ib.tchpc.tcd.ie.
> > mpiexec: All 1 task (spawn 0) started.
> > mpiexec: read_ib_startup_ports: waiting for checkin: 1 to accept, 0 to
> > read.
> > mpiexec: process_obit_event: evt 3 task 0 on iitac128.ib.tchpc.tcd.ie
> > stat 1.
> > mpiexec: kill_tasks: killing all tasks.
> > mpiexec: Warning: task 0 exited with status 1.
> > ------------------------------------------------------------------------------
> 
> This "connect: Connection refused" message comes from the MPI task,
> not from mpiexec.  I have seen this before, unfortunately.  Looking
> at mvapich-0.9.7, it is clear that they included a misguided patch
> from NERSC that the Mellanox people picked up.  I told the original
> author and Mellanox that it was a bad idea, almost a year ago.
> 
> Looks like OSU incorporated it anyway, not just once, but
> _three_times_ in quick successsion.  Look in
> mpid/vapi/process/pmgr_client_mpirun_rsh.c for this comment.
> 
>      /*
>       *  Route stdout and stderr to mpiexec if applicable
>      [..]
> 
> You will find 103 lines of code that must be deleted.  Then you will
> find the same 103 lines again, and again.  Delete all three
> identical sections.
> 
> Hopefully the remaining protocol is the same as earlier versions.  I
> did not look at 0.9.6.  Let me know if you still have problems, and
> please complain to the mvapich authors how wrong this is.
> 
> A FAQ entry (#10) on why this patch is bad has been here for a
> while:
> http://svn.osc.edu/repos/mpiexec/trunk/README
> and now it applies to mvapich-0.9.7 as well as the Mellanox MPI
> releases.  The short answer is that PBSPro doesn't include the
> now-ancient mpiexec patch, so one user tried to put that
> functionality into MPI instead.  Unfortunately, doing so breaks
> OpenPBS and Torque installations badly.
> 
> 		-- Pete
---end quoted text---

-- 
Jimmy Tang
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin.
http://www.tchpc.tcd.ie/


More information about the mpiexec mailing list