Odd behavior with mpiexec for voltaire infiiband package

Christopher D. Maestas cdmaest at sandia.gov
Wed Feb 4 12:45:05 EST 2004


Hello,

In start_tasks.c, you detail that the following should occur in the
startup for infiniband tasks:
---
/*
 * Each IB process connects to our socket, then does three writes:
 *     int version
 *     int mpi_rank (from MPIRUN_RANK)
 *     int addrlen
 *     u8  address[addrlen]  (IB particulars)
 * Then each expects to read back np * address[addrlen] corresponding
 * to the addresses of all of the processes, including itself.
 * After this exchange, still sit around waiting for one more
 * operation, a barrier, after all the QPs are up.
 *
 * Never actually close the listening socket, as that is where a process
 * will call when it needs to cause an MPI_Abort later.
 */
---


The infiniband environment distributed with Voltaire's software
integration package (ibhost-hpc-2.0.0_10-1rh90.k) errors out when run
here:
---
[cdmaest at ca894 mpi]$ /projects/mpiexec/bin/mpiexec -np 2 -pernode
-comm=ib cpi_infiniband
mpiexec: Warning: read_ib_startup_ports: protocol version 0 not known,
but might still work.
mpiexec: Error: read_ib_startup_ports: rank 48 out of bounds [0..2).
read: Connection reset by peer
---

If you comment out the first read_full in start_tasks.c, then things
work for the voltaire stuff.

--- start_tasks.c.orig  Wed Feb  4 09:40:52 2004
+++ start_tasks.c       Wed Feb  4 10:24:15 2004
@@ -1116,7 +1116,7 @@
        /*
         * Read the entire address info for one process.
         */
-       read_full(fd, &version, sizeof(int));
+       /* read_full(fd, &version, sizeof(int)); */
        if (version != 1)
            warning("%s: protocol version %d not known, but might still
work",
              __func__, version);


---
mpiexec: Warning: read_ib_startup_ports: protocol version 134548964 not
known, but might still work.
mpiexec: Warning: read_ib_startup_ports: protocol version 134548964 not
known, but might still work.
Process 1 on ca893
Process 0 on ca894
pi is approximately 3.1416009869231241, Error is 0.0000083333333309
wall clock time = 0.000131
---

 It looks like version here is really the rank, so the rank value is
getting the addrlen value.

This works for us, but wanted to give feedback that it looks like the
infiniband reads may differ out there in distributions.

-- Chris




More information about the mpiexec mailing list