mpiexec starts one process less than expected on dual nodes.
Pete Wyckoff
pw at osc.edu
Fri Apr 4 10:34:54 EST 2003
Roy.Dragseth at cc.uit.no said on Fri, 04 Apr 2003 15:08 +0200:
> The first node allocated gets one running process less than the others when I
> submit the job with -lnodes=2:ppn=2. The process list shows that the correct
> number of processes is started, but one process sleeps through the whole run.
This is a symptom of p4-shmem mismatch between mpich and mpiexec.
> Configuration:
> OpenPBS 2.3.16
> mpich 1.2.5 p4 (not shared)
>
> mpiexec was configured with this line:
> ./configure --disable-p4-shmem --with-pbs=/opt/OpenPBS
> --with-default-comm=mpich-p4
I ran through the four possible combinations of compile options for
these with mpich-1.2.4 p4 just now to remind myself of what is supposed
to happen:
1. mpich: configure --with-comm=shared
mpiexec: configure
Works, process tree looks like:
pbs_mom
\_ hello
\_ hello <- proc #0
\_ hello <- proc #1
2. mpich: configure
mpiexec: configure --disable-p4-shmem
Works, process tree looks like:
pbs_mom
\_ hello
\_ hello <- proc #0
\_ hello
\_ hello <- proc #1
3. mpich: configure
mpiexec: configure
Fails, error message "local slave on uniprocessor without shared
memory"
4. mpich: configure --with-comm=shared
mpiexec: configure --disable-p4-shmem
Fails, timeout (hang) in MPI_Init.
In mpich-p4, the process that is spawned first from pbs_mom becomes a
listener. There is one of these for each shared memory group, that is,
one per compute node. On a two-processor node, this listener will fork
two more processes which become the actual workers. When you do not
enable p4 shared memory, pbs_mom spawns two processes, each of which
becomes a listener and spawns one worker task.
As seen in your ps outputs (thanks for sending that), mpiexec is trying
to spawn the jobs as if it thinks there is no shared memory available,
i.e. case (2) or (4) above. But I suspect your mpich-p4 library thinks
it was compiled with shared memory and is hanging in the MPI_Init. The
listener process should not consume much CPU time if everything is
working properly---it just sets up communication and gets out of the
way.
You might try playing with the included test program "hello.c". It
watches for hanging MPI_Init and will complain at you if this is indeed
the case. And take another look at your mpich compile and make sure you
used the correct paths when compiling your test program.
-- Pete
More information about the mpiexec
mailing list