mpiexec starts one process less than expected on dual nodes.
Roy Dragseth
Roy.Dragseth at cc.uit.no
Fri Apr 4 08:08:06 EST 2003
Hi.
I've just downloaded mpiexec 0.73 and wants to use it on our cluster based on
dual athlon nodes and discovered a strange thing:
The first node allocated gets one running process less than the others when I
submit the job with -lnodes=2:ppn=2. The process list shows that the correct
number of processes is started, but one process sleeps through the whole run.
The job statistics also shows that cputime is 3 times higher than walltime, it
should be 4 times higher.
If I do -lnodes=3:ppn=1, eg. running one process pr. node, everything seems
to be fine.
Configuration:
OpenPBS 2.3.16
mpich 1.2.5 p4 (not shared)
mpiexec was configured with this line:
./configure --disable-p4-shmem --with-pbs=/opt/OpenPBS
--with-default-comm=mpich-p4
Attached is the process list on the two nodes, compute-0-2 and compute-0-1,
if you look at the list for compute-0-2 you see that pid 22214 isn't
consuming any cputime in this case. mpi_test_clean is a trivial program just
containing mpi_init and mpi_finalize with a spinning loop in between.
Also, here is the output from mpiexec -verbose for this case:
resolve_exe: using absolute exe "./mpi_test_clean.x"
node 0: name = compute-0-2, mpname = compute-0-2, cpu = 1
node 1: name = compute-0-2, mpname = compute-0-2, cpu = 0
node 2: name = compute-0-1, mpname = compute-0-1, cpu = 1
node 3: name = compute-0-1, mpname = compute-0-1, cpu = 0
Any suggestions is greatly appreciated as mpiexec seems to fix all our
problems with dangling processes on our cluster.
Best regards,
Roy.
--
The Computer Center, University of Tromsø, N-9037 TROMSØ, Norway.
phone:+47 77 64 41 07, fax:+47 77 64 41 00
Roy Dragseth, High Performance Computing System Administrator
Direct call: +47 77 64 62 56. email: royd at cc.uit.no
-------------- next part --------------
[royd at compute-0-2 royd]$ ps -fHu royd
UID PID PPID C STIME TTY TIME CMD
royd 22285 22283 1 12:48 pts/1 00:00:00 -bash
royd 22322 22285 0 12:49 pts/1 00:00:00 ps -fHu royd
royd 22216 963 90 12:45 ? 00:03:00 ./mpi_test_clean.x -p4wd /home/royd/cprog -execer_id mpiexec -master_host compute-0-2 -my_hostname compute
royd 22217 22216 0 12:45 ? 00:00:00 ./mpi_test_clean.x -p4wd /home/royd/cprog -execer_id mpiexec -master_host compute-0-2 -my_hostname compu
royd 22214 963 0 12:45 ? 00:00:00 ./mpi_test_clean.x -p4wd /home/royd/cprog -execer_id mpiexec -master_host compute-0-2 -my_hostname compute
royd 22215 22214 0 12:45 ? 00:00:00 ./mpi_test_clean.x -p4wd /home/royd/cprog -execer_id mpiexec -master_host compute-0-2 -my_hostname compu
royd 22176 963 0 12:45 ? 00:00:00 -bash
royd 22181 22176 0 12:45 ? 00:00:00 pbs_demux
royd 22209 22176 0 12:45 ? 00:00:00 /bin/bash -x /opt/OpenPBS/mom_priv/jobs/164.zuma.SC
royd 22211 22209 0 12:45 ? 00:00:00 /home/royd/src/mpiexec-0.73/mpiexec -verbose -nostdout ./mpi_test_clean.x
royd 22213 22211 0 12:45 ? 00:00:00 /home/royd/src/mpiexec-0.73/mpiexec -verbose -nostdout ./mpi_test_clean.x
[royd at compute-0-1 royd]$ ps -fHu royd
UID PID PPID C STIME TTY TIME CMD
royd 2712 2710 0 12:48 pts/0 00:00:00 -bash
royd 2749 2712 0 12:48 pts/0 00:00:00 ps -fHu royd
royd 2683 947 95 12:45 ? 00:02:41 ./mpi_test_clean.x -p4wd /home/royd/cprog -execer_id mpiexec -master_host compute-0-2 -my_hostname compute
royd 2685 2683 0 12:45 ? 00:00:00 ./mpi_test_clean.x -p4wd /home/royd/cprog -execer_id mpiexec -master_host compute-0-2 -my_hostname compu
royd 2678 947 91 12:45 ? 00:02:42 ./mpi_test_clean.x -p4wd /home/royd/cprog -execer_id mpiexec -master_host compute-0-2 -my_hostname compute
royd 2684 2678 0 12:45 ? 00:00:00 ./mpi_test_clean.x -p4wd /home/royd/cprog -execer_id mpiexec -master_host compute-0-2 -my_hostname compu
More information about the mpiexec
mailing list