mpiexec mvapich rank error
Pete Wyckoff
pw at osc.edu
Wed Jan 25 16:58:51 EST 2006
Alex.Ninaber at clustervision.com wrote on Wed, 25 Jan 2006 18:21 +0100:
> I have the following error from IB & mpiexec:
>
> Error: read_ib_startup_ports: barrier expecting rank 0, got 10.
>
> - mpiexec svn, checkout 1/25/2006
> - mvapich gen2
> - torque 2
>
> Please see below the output from Torque, any ideas why it gets the wrong
> rank (rank 10 with a 4 processor job)?
[..]
> mpiexec: read_ib_one: version 2 startup.
[..]
> mpiexec: Error: read_ib_startup_ports: barrier expecting rank 0, got 10.
This is the OSU mpich on openib (InfiniBand) implementation, for
those without the scorecard. They started with mpich on vapi, an
earlier IB interface, then ported to openib more recently,
unfortunately these two versions of the code do not share source so
fixes in the vapi version do not always show up in the openib version.
I pointed out this bug a while back:
http://openib.org/pipermail/openib-general/2005-September/010853.html
The patch in particular on the vapi patches page is #112
http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich/patch-0.9.5-112
It fixes a segv in mpirun, bumps the version number to fix part of
the problem you see, and helps scalability by sending binary pids
instead of 10-digit ascii. This is how mvapich (non-gen2) works now.
But it may apply with a bit of fuzz to your openib tree. Attached
is the current diff of my modified openib SVN tree. (Actually I haven't
tested the mpirun_rsh.c part of it because I never use that. :) )
The same bug exists in the mvapich-gen2-1.0-106.tar.gz download from
OSU as well as in the openib tree. If you complain to the openib
mailing list, that may help push them to fix the bugs in their
mvapich-gen2. I don't know which of these sources is more
up-to-date, although the openib one is missing lots of components
like MPE that you may or may not care about.
-- Pete
More information about the mpiexec
mailing list