mpiexec svn and mvapich 0.97 failure to start jobs

Jimmy Tang jtang at tchpc.tcd.ie
Fri Jul 7 10:38:34 EDT 2006


Hi Pete,



On Wed, Mar 15, 2006 at 04:55:22PM -0500, Pete Wyckoff wrote:
> 
> This "connect: Connection refused" message comes from the MPI task,
> not from mpiexec.  I have seen this before, unfortunately.  Looking
> at mvapich-0.9.7, it is clear that they included a misguided patch
> from NERSC that the Mellanox people picked up.  I told the original
> author and Mellanox that it was a bad idea, almost a year ago.
> 
> Looks like OSU incorporated it anyway, not just once, but
> _three_times_ in quick successsion.  Look in
> mpid/vapi/process/pmgr_client_mpirun_rsh.c for this comment.
> 
>      /*
>       *  Route stdout and stderr to mpiexec if applicable
>      [..]
> 
> You will find 103 lines of code that must be deleted.  Then you will
> find the same 103 lines again, and again.  Delete all three
> identical sections.
> 
> Hopefully the remaining protocol is the same as earlier versions.  I
> did not look at 0.9.6.  Let me know if you still have problems, and
> please complain to the mvapich authors how wrong this is.
> 
> A FAQ entry (#10) on why this patch is bad has been here for a
> while:
> http://svn.osc.edu/repos/mpiexec/trunk/README
> and now it applies to mvapich-0.9.7 as well as the Mellanox MPI
> releases.  The short answer is that PBSPro doesn't include the
> now-ancient mpiexec patch, so one user tried to put that
> functionality into MPI instead.  Unfortunately, doing so breaks
> OpenPBS and Torque installations badly.
> 
> 		-- Pete
---end quoted text---

I've just been playing with mvapich 0.9.8 and the trunk version (checked
out from yesterday)

and I checked the mpid/vapi/process/pmgr_client_mpirun_rsh.c file and
diffed it against a patched version from 0.9.7


---some stuff----
15:16:53 login02 /scratch/mvapich/src/mvapich-0.9.8-rc0 $ diff /usr/support/src/mvapich-0.9.7/mpid/vapi/process/pmgr_client_mpirun_rsh.c  mpid/vapi/process/pmgr_client_mpirun_rsh.c
42,44d41
< static int mpirun_stdin_port;
< static int mpirun_stdout_port;
< static int mpirun_stderr_port;
48,50d44
< static int mpirun_stdin_socket;
< static int mpirun_stdout_socket;
< static int mpirun_stderr_socket;
60d53
<     struct sockaddr_in sockaddr;
73c66,67
<         herror("gethostbyname");
---
>         fprintf(stderr, "gethostbyname failed:: %s: %s (%d)\n",
>                 mpirun_hostname, hstrerror(h_errno), h_errno);
170c164
<     return 1;
---
>    return 1;
200c194,195
<         herror("gethostbyname");
---
>         fprintf(stderr, "gethostbyname failed:: %s: %s (%d)\n",
>                 mpirun_hostname, hstrerror(h_errno), h_errno);
405c400,402
<     if (!he)
---
>     if (!he) {
>         fprintf(stderr,"gethostbyname failed: %s: %s (%d)\n",
>                 mpirun_hostname, hstrerror(h_errno), h_errno);
406a404
----end some stuff----

the mvapich devs have pretty much removed the offending code from 0.9.8rc0
and higher. I've compiled it up and tested both the 0.9.8rc0 and trunk
branches, and mpiexec 0.80 works as expected. I would expect 0.81 would
probably work as well, but I havent tested it.

well, this was just to bump this thread with some updated information.




Jimmy.

-- 
Jimmy Tang
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin.
http://www.tchpc.tcd.ie/


More information about the mpiexec mailing list