mpiexec svn and mvapich 0.97 failure to start jobs
Jimmy Tang
jtang at tchpc.tcd.ie
Wed Mar 15 18:37:05 EST 2006
Hi Pete,
Thanks for the pointer on that block of code, i removed all three
duplicate sections and mpiexec worked as expected. I shall send the
mvapich and voltaire people an email to try and have that patch fixed
(removed) or at least I can complain about it :)
Thanks again,
Jimmy.
On Wed, Mar 15, 2006 at 04:55:22PM -0500, Pete Wyckoff wrote:
> jtang at tchpc.tcd.ie wrote on Wed, 15 Mar 2006 18:08 +0000:
> > With the announcement of mvapich 0.97 in the last day or so, I decided
> > that it might be nice to run/install the latest and greatest version of mvapich
> > and mpiexec for our infiniband cluster.
> >
> > mpiexec svn checkout - 20060315
> > mvapich 0.97
> > our IB stack is the voltaire shipped ibhost-3.x
> > (the current release 3.5.0 i think)
> >
> > I tried compiling up cpi.c with 0.97 of mvapich using the latest mpiexec
> > version and I get this....
> >
> >
> > ------------------------------------------------------------------------------
> > 17:53:15 jtang at iitac128 ~/sandbox/mpiexec-tests $
> > /usr/support/mpiexec-svn-20060315/bin/mpiexec -verbose ./cpi-97
> > mpiexec: resolve_exe: using absolute exe "./cpi-97".
> > connect: Connection refused
> > mpiexec: process_start_event: evt 2 task 0 on iitac128.ib.tchpc.tcd.ie.
> > mpiexec: All 1 task (spawn 0) started.
> > mpiexec: read_ib_startup_ports: waiting for checkin: 1 to accept, 0 to
> > read.
> > mpiexec: process_obit_event: evt 3 task 0 on iitac128.ib.tchpc.tcd.ie
> > stat 1.
> > mpiexec: kill_tasks: killing all tasks.
> > mpiexec: Warning: task 0 exited with status 1.
> > ------------------------------------------------------------------------------
>
> This "connect: Connection refused" message comes from the MPI task,
> not from mpiexec. I have seen this before, unfortunately. Looking
> at mvapich-0.9.7, it is clear that they included a misguided patch
> from NERSC that the Mellanox people picked up. I told the original
> author and Mellanox that it was a bad idea, almost a year ago.
>
> Looks like OSU incorporated it anyway, not just once, but
> _three_times_ in quick successsion. Look in
> mpid/vapi/process/pmgr_client_mpirun_rsh.c for this comment.
>
> /*
> * Route stdout and stderr to mpiexec if applicable
> [..]
>
> You will find 103 lines of code that must be deleted. Then you will
> find the same 103 lines again, and again. Delete all three
> identical sections.
>
> Hopefully the remaining protocol is the same as earlier versions. I
> did not look at 0.9.6. Let me know if you still have problems, and
> please complain to the mvapich authors how wrong this is.
>
> A FAQ entry (#10) on why this patch is bad has been here for a
> while:
> http://svn.osc.edu/repos/mpiexec/trunk/README
> and now it applies to mvapich-0.9.7 as well as the Mellanox MPI
> releases. The short answer is that PBSPro doesn't include the
> now-ancient mpiexec patch, so one user tried to put that
> functionality into MPI instead. Unfortunately, doing so breaks
> OpenPBS and Torque installations badly.
>
> -- Pete
---end quoted text---
--
Jimmy Tang
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin.
http://www.tchpc.tcd.ie/
More information about the mpiexec
mailing list