MPIEXEC Problems
Pete Wyckoff
pw at osc.edu
Mon Jul 7 17:06:26 EDT 2003
James_ODell at brown.edu said on Mon, 07 Jul 2003 15:41 -0400:
> Perhaps someone can help me.
>
> I usuing MPICH-1.2.5 compiled without shared memory.
> I built MPIEXEC using:
> ./configure -with-pbs=/opt/OpenPBS/ --prefix=/opt/mpiexec
> --disable-p4-shmem
>
> everthing builds correctly.
Looks good.
> I have a small test file test.sh:
>
> #!/bin/sh
> #PBS -l nodes=4:ppn=2
> #PBS -l walltime=5:00
> #PBS -l cput=40:00
> #PBS -j oe
> #PBS -o test.out
> sed -e "s/$/.gig.net/" $PBS_NODEFILE > nodes
> PBS_NODEFILE=nodes
> mpiexec --nostdin -nostdout --verbose --comm=p4 hello
Note that mpiexec ignores PBS_NODEFILE. This isn't the cause of the
problem below, but if you want to use a different interface, you might
try:
mpiexec -transform-hostname='s/$/.gig.net/' ...
and get what you're looking for above.
> When I run it test.out contains:
>
> resolve_exe: found exe "hello" in path
> node 0: name = compute-0-24, mpname = compute-0-24, cpu = 1
> node 1: name = compute-0-24, mpname = compute-0-24, cpu = 0
> node 2: name = compute-0-25, mpname = compute-0-25, cpu = 1
> node 3: name = compute-0-25, mpname = compute-0-25, cpu = 0
> node 4: name = compute-1-14, mpname = compute-1-14, cpu = 1
> node 5: name = compute-1-14, mpname = compute-1-14, cpu = 0
> node 6: name = compute-1-15, mpname = compute-1-15, cpu = 1
> node 7: name = compute-1-15, mpname = compute-1-15, cpu = 0
> mpiexec: Error: wait_one_task_start: tm_poll remote: System error.
> wait_one_task_start: evt = 2, task 0 host compute-0-24
> read_p4_master_port: waiting for port from master
> read_p4_master_port: got port 50988
The way p4 startup works is that first just one job is spawned (the
master), then mpiexec reads a port number from it which it feeds to
later processes so they can find the master.
First it waits for TM, i.e. the pbs_mom, to say the job has started.
That is fine since we see the wait_one_task_start happy message. Then
it goes to read the port from the master process over a socket. That
worked fine too.
Then mpiexec starts up the rest of the tasks, but while waiting to see
if they ran okay, was presented with the terrible "System error" message
from the pbs_mom. The error message is out of order in your output file
since the pbs demuxer isn't too careful about preserving order between
stdout and stderr. You can run mpiexec inside an interactive job to see
the messages in order if you want to make sure.
> The mom_logs directory on the lead node contains the following:
>
> 07/07/2003 15:32:58;0008;
> pbs_mom;Job;2015.lou.cascv.brown.edu;Started, pid = 8297
> 07/07/2003 15:32:58;0100; pbs_mom;Req;;Type 19 request received from
> PBS_Server at frontend-1-16, sock=10
> 07/07/2003 15:32:58;0008; pbs_mom;Job;2015.lou.cascv.brown.edu;task
> started, /bin/sh
> 07/07/2003 15:32:58;0080; pbs_mom;Job;2015.lou.cascv.brown.edu;task 1
> terminated
Why did the task die immediately after starting? This is not
a good sign. Can you run "hello" using good-old mpich-p4 mpirun?
> 07/07/2003 15:32:58;0008;
> pbs_mom;Job;2015.lou.cascv.brown.edu;Terminated
> 07/07/2003 15:33:04;0001; pbs_mom;Svr;pbs_mom;task_check, cannot
> tm_reply to 2015.lou.cascv.brown.edu task 1
I'm guessing these mean that there is no socket to the job, since it
died. Not sure why this could not be handled more gracefully in PBS.
> 07/07/2003 15:33:04;0001; pbs_mom;Svr;pbs_mom;task_check, cannot
> tm_reply to 2015.lou.cascv.brown.edu task 1
> 07/07/2003 15:33:04;0001; pbs_mom;Svr;pbs_mom;task_check, cannot
> tm_reply to 2015.lou.cascv.brown.edu task 1
> 07/07/2003 15:33:04;0001; pbs_mom;Svr;pbs_mom;task_check, cannot
> tm_reply to 2015.lou.cascv.brown.edu task 1
> 07/07/2003 15:33:08;0080; pbs_mom;Job;2015.lou.cascv.brown.edu;task 2
> terminated
The second task on the node dying. Not sure what happened to its
"starting" message.
> 07/07/2003 15:33:10;0001; pbs_mom;Svr;pbs_mom;task_check, cannot
> tm_reply to 2015.lou.cascv.brown.edu task 1
> 07/07/2003 15:33:10;0001; pbs_mom;Svr;pbs_mom;task_check, cannot
> tm_reply to 2015.lou.cascv.brown.edu task 1
> 07/07/2003 15:33:10;0008;
> pbs_mom;Job;2015.lou.cascv.brown.edu;kill_job
> 07/07/2003 15:33:10;0080; pbs_mom;Job;2015.lou.cascv.brown.edu;Obit
> sent
> 07/07/2003 15:33:10;0100; pbs_mom;Req;;Type 54 request received from
> PBS_Server at frontend-1-16, sock=11
> 07/07/2003 15:33:10;0100; pbs_mom;Req;;Type 6 request received from
> PBS_Server at frontend-1-16, sock=11
Use an interactive job (qsub -I) to play around with your setup. Make
sure that mpirun does the right thing, then maybe just mpiexec -np 1
or -np 2, working up to your big job. You may want _not_ to turn off
stdout and see if that gives you any more information. Sorry for no
clear solution yet.
-- Pete
More information about the mpiexec
mailing list