mpiexec starts one process less than expected on dual nodes.

Roy Dragseth Roy.Dragseth at cc.uit.no
Fri Apr 4 08:08:06 EST 2003


Hi.

I've just downloaded mpiexec 0.73 and wants to use it on our cluster based on 
dual athlon nodes and discovered a strange thing:

The first node allocated gets one running process less than the others when I 
submit the job with -lnodes=2:ppn=2.  The process list shows that the correct 
number of processes is started, but one process sleeps through the whole run.

The job statistics also shows that cputime is 3 times higher than walltime, it 
should be 4 times higher.  

If I do -lnodes=3:ppn=1, eg. running one process pr. node,  everything seems 
to be fine.

Configuration:
OpenPBS 2.3.16
mpich 1.2.5 p4 (not shared)

mpiexec was configured with this line:
./configure  --disable-p4-shmem --with-pbs=/opt/OpenPBS 
--with-default-comm=mpich-p4

Attached is  the process list on the two nodes, compute-0-2 and compute-0-1, 
if you look at the list for compute-0-2 you see that pid 22214 isn't 
consuming any cputime in this case.  mpi_test_clean is a trivial program just 
containing mpi_init and mpi_finalize with a spinning loop in between.

Also, here is the output from mpiexec -verbose for this case:

resolve_exe: using absolute exe "./mpi_test_clean.x"
node  0: name = compute-0-2, mpname = compute-0-2, cpu = 1
node  1: name = compute-0-2, mpname = compute-0-2, cpu = 0
node  2: name = compute-0-1, mpname = compute-0-1, cpu = 1
node  3: name = compute-0-1, mpname = compute-0-1, cpu = 0


Any suggestions is greatly appreciated as mpiexec seems to fix all our 
problems with dangling processes on our cluster.

Best regards,
Roy.

-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ, Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
     Roy Dragseth, High Performance Computing System Administrator
	 Direct call: +47 77 64 62 56. email: royd at cc.uit.no
-------------- next part --------------
[royd at compute-0-2 royd]$ ps -fHu royd
UID        PID  PPID  C STIME TTY          TIME CMD
royd     22285 22283  1 12:48 pts/1    00:00:00 -bash
royd     22322 22285  0 12:49 pts/1    00:00:00   ps -fHu royd
royd     22216   963 90 12:45 ?        00:03:00 ./mpi_test_clean.x -p4wd /home/royd/cprog -execer_id mpiexec -master_host compute-0-2 -my_hostname compute
royd     22217 22216  0 12:45 ?        00:00:00   ./mpi_test_clean.x -p4wd /home/royd/cprog -execer_id mpiexec -master_host compute-0-2 -my_hostname compu
royd     22214   963  0 12:45 ?        00:00:00 ./mpi_test_clean.x -p4wd /home/royd/cprog -execer_id mpiexec -master_host compute-0-2 -my_hostname compute
royd     22215 22214  0 12:45 ?        00:00:00   ./mpi_test_clean.x -p4wd /home/royd/cprog -execer_id mpiexec -master_host compute-0-2 -my_hostname compu
royd     22176   963  0 12:45 ?        00:00:00 -bash
royd     22181 22176  0 12:45 ?        00:00:00   pbs_demux
royd     22209 22176  0 12:45 ?        00:00:00   /bin/bash -x /opt/OpenPBS/mom_priv/jobs/164.zuma.SC
royd     22211 22209  0 12:45 ?        00:00:00     /home/royd/src/mpiexec-0.73/mpiexec -verbose -nostdout ./mpi_test_clean.x
royd     22213 22211  0 12:45 ?        00:00:00       /home/royd/src/mpiexec-0.73/mpiexec -verbose -nostdout ./mpi_test_clean.x


[royd at compute-0-1 royd]$ ps -fHu royd
UID        PID  PPID  C STIME TTY          TIME CMD
royd      2712  2710  0 12:48 pts/0    00:00:00 -bash
royd      2749  2712  0 12:48 pts/0    00:00:00   ps -fHu royd
royd      2683   947 95 12:45 ?        00:02:41 ./mpi_test_clean.x -p4wd /home/royd/cprog -execer_id mpiexec -master_host compute-0-2 -my_hostname compute
royd      2685  2683  0 12:45 ?        00:00:00   ./mpi_test_clean.x -p4wd /home/royd/cprog -execer_id mpiexec -master_host compute-0-2 -my_hostname compu
royd      2678   947 91 12:45 ?        00:02:42 ./mpi_test_clean.x -p4wd /home/royd/cprog -execer_id mpiexec -master_host compute-0-2 -my_hostname compute
royd      2684  2678  0 12:45 ?        00:00:00   ./mpi_test_clean.x -p4wd /home/royd/cprog -execer_id mpiexec -master_host compute-0-2 -my_hostname compu


More information about the mpiexec mailing list