MPIEXEC Problems

James O'Dell James_ODell at Brown.edu
Mon Jul 7 15:41:57 EDT 2003


Perhaps someone can help me.

I usuing MPICH-1.2.5 compiled without shared memory.
I built MPIEXEC using:
./configure  -with-pbs=/opt/OpenPBS/ --prefix=/opt/mpiexec
--disable-p4-shmem

everthing builds correctly. 

I have a small test file test.sh:

#!/bin/sh
#PBS -l nodes=4:ppn=2
#PBS -l walltime=5:00
#PBS -l cput=40:00
#PBS -j oe
#PBS -o test.out
sed -e "s/$/.gig.net/" $PBS_NODEFILE > nodes
PBS_NODEFILE=nodes
mpiexec --nostdin -nostdout --verbose --comm=p4  hello

When I run it test.out contains:

resolve_exe: found exe "hello" in path
node  0: name = compute-0-24, mpname = compute-0-24, cpu = 1
node  1: name = compute-0-24, mpname = compute-0-24, cpu = 0
node  2: name = compute-0-25, mpname = compute-0-25, cpu = 1
node  3: name = compute-0-25, mpname = compute-0-25, cpu = 0
node  4: name = compute-1-14, mpname = compute-1-14, cpu = 1
node  5: name = compute-1-14, mpname = compute-1-14, cpu = 0
node  6: name = compute-1-15, mpname = compute-1-15, cpu = 1
node  7: name = compute-1-15, mpname = compute-1-15, cpu = 0
mpiexec: Error: wait_one_task_start: tm_poll remote: System error.
wait_one_task_start: evt = 2, task 0 host compute-0-24
read_p4_master_port: waiting for port from master
read_p4_master_port: got port 50988

The mom_logs directory on the lead node contains the following:

07/07/2003 15:32:58;0008;  
pbs_mom;Job;2015.lou.cascv.brown.edu;Started, pid = 8297
07/07/2003 15:32:58;0100;   pbs_mom;Req;;Type 19 request received from
PBS_Server at frontend-1-16, sock=10
07/07/2003 15:32:58;0008;   pbs_mom;Job;2015.lou.cascv.brown.edu;task
started, /bin/sh
07/07/2003 15:32:58;0080;   pbs_mom;Job;2015.lou.cascv.brown.edu;task 1
terminated
07/07/2003 15:32:58;0008;  
pbs_mom;Job;2015.lou.cascv.brown.edu;Terminated
07/07/2003 15:33:04;0001;   pbs_mom;Svr;pbs_mom;task_check, cannot
tm_reply to 2015.lou.cascv.brown.edu task 1
07/07/2003 15:33:04;0001;   pbs_mom;Svr;pbs_mom;task_check, cannot
tm_reply to 2015.lou.cascv.brown.edu task 1
07/07/2003 15:33:04;0001;   pbs_mom;Svr;pbs_mom;task_check, cannot
tm_reply to 2015.lou.cascv.brown.edu task 1
07/07/2003 15:33:04;0001;   pbs_mom;Svr;pbs_mom;task_check, cannot
tm_reply to 2015.lou.cascv.brown.edu task 1
07/07/2003 15:33:08;0080;   pbs_mom;Job;2015.lou.cascv.brown.edu;task 2
terminated
07/07/2003 15:33:10;0001;   pbs_mom;Svr;pbs_mom;task_check, cannot
tm_reply to 2015.lou.cascv.brown.edu task 1
07/07/2003 15:33:10;0001;   pbs_mom;Svr;pbs_mom;task_check, cannot
tm_reply to 2015.lou.cascv.brown.edu task 1
07/07/2003 15:33:10;0008;  
pbs_mom;Job;2015.lou.cascv.brown.edu;kill_job
07/07/2003 15:33:10;0080;   pbs_mom;Job;2015.lou.cascv.brown.edu;Obit
sent
07/07/2003 15:33:10;0100;   pbs_mom;Req;;Type 54 request received from
PBS_Server at frontend-1-16, sock=11
07/07/2003 15:33:10;0100;   pbs_mom;Req;;Type 6 request received from
PBS_Server at frontend-1-16, sock=11


Anybody know what I'm doing wrong?

Thanks,
Jim




More information about the mpiexec mailing list