mpiexec and tm fundamentals

Eygene Ryabinkin rea+maui at grid.kiae.ru
Fri Oct 5 03:23:07 EDT 2007


Joshua, good day.

Thu, Oct 04, 2007 at 11:33:39AM -0700, Joshua Bernstein wrote:
> So does this mean that mpiexec has to implement a copy of the startup interface 
> for each MPI spec it supports?

Yes.  The following interfaces are supported on the 0.82 (obtained
from the output of the bare 'mpiexec'):
-----
  -comm (gm|mx|p4|ib|rai|pmi|lam|shmem|emp|none) : choose MPI (default mpich-ib)
  -mpich-p4-[no-]shmem : for MPICH/P4, specify if the library was
			 compiled with shared memory support (default yes)
-----

> Is there a way to figure out what startup 
> interface my MPICH is using? Is this the same thing as the MPICH comm? I'm 
> building MPICH as follows:
> 
>     ./configure --with-arch=LINUX \
>                 --prefix=/usr \
>                 --enable-debug \
>                 --with-comm=bproc  \
>                 --with-romio=--with-mpi=mpich   			 
> --with-romio=--with-file-system=nfs+ufs  --with-romio=--enable-aio=no 
> --disable-devdebug -lpthread" --with-device=%p4 \

Seems like you're using MPICH 1.x.

> So you can see I am using the comm "bproc" interface. I am indeed running on a
> Scyld/Bproc based cluster.

Yes, but it seems to me that the 'bproc' is the task distribution
and startup interface that mimicks the regilar UNIX way to do it.
If you want to use Torque, then Torque will spawn jobs for you, so
bproc will not be in the loop.

So, 'bproc' is not the MPI startup interface flavour.

I am not _very_ familiar with MPICH, but it seems that MPICH2 is
using 'p4' interface for Ethernet-based fabric and 'gm' or 'mx'
for Myrinet-based fabric.  Consult mpiexec's page at
	http://www.osc.edu/~pw/mpiexec/index.php#Description
for the list of the MPI implementations.  Try 'p4' first and
then try the other interfaces.

> Fair enough. I guess the real problem I'm having is how to handle running MPI 
> based jobs linked against a MPICH library built as I described above, under 
> TORQUE.
> 
> I know that when MPICH is using the bproc comm, MPICH makes a 
> bproc_exec(<node_number> call to move the job to the remote node. After the 
> jobs get to the node and fork, I'm not sure what startup method they use after
> that.

As I explained above, Torque will do the spawns for you if you're
using Torque to launch your job.

> The problem is this, in my jobs script, I try to start an MPI job in the same 
> why I would outside TORQUE:
> 
> ---
> #PBS -j oe
> <code to set BEOWULF_JOB_MAP based on PBS_NODEFILE>
> exec ./mpijob
> ---
> 
> This of course correctly starts the jobs on the nodes, but if I do a qdel, to 
> kill the job, the job leaves the TORQUE queue, but the processes still stay on
> the nodes. This behavior has lead me to use mpiexec.
> 
> So, if I use mpiexec a la:
> 
> ---
> #PBS -j oe
> <code to set BEOWULF_JOB_MAP based on PBS_NODEFILE>
> mpiexec -comm none ./mpijob
> ---

'-comm none' is just spawns a bunch of processes that are not
connected to each other.  Try other interfaces, as I suggested
above.

> The jobs again, start properly on the nodes (albeit a bit slower), and then 
> when I do a qdel, the processes get properly cleaned off the nodes. The trouble 
> here is that the job still shows up in the TORQUE queue marked as running. The 
> only way to clean up this job is to remove its entries from 
> $PBS_HOME/server_priv/job.
> 
> Also what might be of interest is if I simply do not set BEOWULF_JOB_MAP a la:
> 
> ---
> #PBS -j oe
> mpiexec -comm none ./mpijob
> ----
> 
> I end up with several single threaded MPI jobs. So for example if I do 
> node=2:ppn=2, I end up with 4 separate jobs.

I don't familiar with the BEOWULF_JOB_MAP, but it seems to me that
this thing is used by MPICH on the Scyld/Beowulf clusters to do job
scheduling by hand.  Torque will schedule jobs for you, so this is
not needed.  Likely, when you're specifying BEOWULF_JOB_MAP, MPICH
tries to mangle the jobs in some way and this confuses Torque.

So, get rid of BEOWULF_JOB_MAP and try to find the mpiexec '-comm'
option value that suits your needs.  Start with 'p4' and if it won't
work, try other values.  And maybe for startup you'll need to omit
qsub's "-j oe" option to get the clean situation for you experiments.
Though, I just hadn't tried to use "-j oe" under mpiexec and don't
know how it behaves: may be there is no point to omit it.
-- 
Eygene Ryabinkin, RRC KI


More information about the mpiexec mailing list