MPIEXEC Problems

James O'Dell James_ODell at Brown.edu
Mon Jul 7 17:25:40 EDT 2003


Pete:

Thanks for the suggestions.

I changed test.sh to:

#!/bin/sh
#PBS -l nodes=4:ppn=2
#PBS -l walltime=5:00
#PBS -l cput=40:00
#PBS -j oe
#PBS -o test.out
#mpiexec --nostdin -nostdout --verbose --comm=p4  hello
mpirun -np 8 -machinefile $PBS_NODEFILE /opt/mpiexec/bin/hello
 

and got the follwing output:

Warning: Remote host denied X11 forwarding.
Warning: Remote host denied X11 forwarding.
Warning: Remote host denied X11 forwarding.
Warning: Remote host denied X11 forwarding.
Warning: Remote host denied X11 forwarding.
Warning: Remote host denied X11 forwarding.
Warning: Remote host denied X11 forwarding.
hello from 0/8 hostname compute-0-0 with 0 args:
hello from 3/8 hostname compute-0-1 with 0 args:
hello from 5/8 hostname compute-0-2 with 0 args:
hello from 1/8 hostname compute-0-0 with 0 args:
hello from 7/8 hostname compute-0-3 with 0 args:
hello from 2/8 hostname compute-0-1 with 0 args:
hello from 6/8 hostname compute-0-3 with 0 args:
hello from 4/8 hostname compute-0-2 with 0 args:
 
When I try it interactively it works for 1 node and ppn=1

when I try with ppn=2 I get the following:

[odell at lou odell]$ qsub -l nodes=1:ppn=2 -I
qsub: waiting for job 2024.lou.cascv.brown.edu to start
qsub: job 2024.lou.cascv.brown.edu ready
                                                                                
[odell at compute-0-0 odell]$ mpiexec hello
mpiexec: Error: wait_one_task_start: tm_poll remote: System error.
[odell at compute-0-0 odell]$ hello from 0/1 hostname compute-0-0 with 0
args:

==========================
The funny thing is I have an almost identical cluster where MPIEXEC
works flawlessly.


Any other ideas would be greatly appreociated.

Jim

On Mon, 2003-07-07 at 17:06, Pete Wyckoff wrote:
> James_ODell at brown.edu said on Mon, 07 Jul 2003 15:41 -0400:
> > Perhaps someone can help me.
> > 
> > I usuing MPICH-1.2.5 compiled without shared memory.
> > I built MPIEXEC using:
> > ./configure  -with-pbs=/opt/OpenPBS/ --prefix=/opt/mpiexec
> > --disable-p4-shmem
> > 
> > everthing builds correctly. 
> 
> Looks good.
> 
> > I have a small test file test.sh:
> > 
> > #!/bin/sh
> > #PBS -l nodes=4:ppn=2
> > #PBS -l walltime=5:00
> > #PBS -l cput=40:00
> > #PBS -j oe
> > #PBS -o test.out
> > sed -e "s/$/.gig.net/" $PBS_NODEFILE > nodes
> > PBS_NODEFILE=nodes
> > mpiexec --nostdin -nostdout --verbose --comm=p4  hello
> 
> Note that mpiexec ignores PBS_NODEFILE.  This isn't the cause of the
> problem below, but if you want to use a different interface, you might
> try:
>     mpiexec -transform-hostname='s/$/.gig.net/' ...
> and get what you're looking for above.
> 
> > When I run it test.out contains:
> > 
> > resolve_exe: found exe "hello" in path
> > node  0: name = compute-0-24, mpname = compute-0-24, cpu = 1
> > node  1: name = compute-0-24, mpname = compute-0-24, cpu = 0
> > node  2: name = compute-0-25, mpname = compute-0-25, cpu = 1
> > node  3: name = compute-0-25, mpname = compute-0-25, cpu = 0
> > node  4: name = compute-1-14, mpname = compute-1-14, cpu = 1
> > node  5: name = compute-1-14, mpname = compute-1-14, cpu = 0
> > node  6: name = compute-1-15, mpname = compute-1-15, cpu = 1
> > node  7: name = compute-1-15, mpname = compute-1-15, cpu = 0
> > mpiexec: Error: wait_one_task_start: tm_poll remote: System error.
> > wait_one_task_start: evt = 2, task 0 host compute-0-24
> > read_p4_master_port: waiting for port from master
> > read_p4_master_port: got port 50988
> 
> The way p4 startup works is that first just one job is spawned (the
> master), then mpiexec reads a port number from it which it feeds to
> later processes so they can find the master.
> 
> First it waits for TM, i.e. the pbs_mom, to say the job has started.
> That is fine since we see the wait_one_task_start happy message.  Then
> it goes to read the port from the master process over a socket.  That
> worked fine too.
> 
> Then mpiexec starts up the rest of the tasks, but while waiting to see
> if they ran okay, was presented with the terrible "System error" message
> from the pbs_mom.  The error message is out of order in your output file
> since the pbs demuxer isn't too careful about preserving order between
> stdout and stderr.  You can run mpiexec inside an interactive job to see
> the messages in order if you want to make sure.
> 
> > The mom_logs directory on the lead node contains the following:
> > 
> > 07/07/2003 15:32:58;0008;  
> > pbs_mom;Job;2015.lou.cascv.brown.edu;Started, pid = 8297
> > 07/07/2003 15:32:58;0100;   pbs_mom;Req;;Type 19 request received from
> > PBS_Server at frontend-1-16, sock=10
> > 07/07/2003 15:32:58;0008;   pbs_mom;Job;2015.lou.cascv.brown.edu;task
> > started, /bin/sh
> > 07/07/2003 15:32:58;0080;   pbs_mom;Job;2015.lou.cascv.brown.edu;task 1
> > terminated
> 
> Why did the task die immediately after starting?  This is not
> a good sign.  Can you run "hello" using good-old mpich-p4 mpirun?
> 
> > 07/07/2003 15:32:58;0008;  
> > pbs_mom;Job;2015.lou.cascv.brown.edu;Terminated
> > 07/07/2003 15:33:04;0001;   pbs_mom;Svr;pbs_mom;task_check, cannot
> > tm_reply to 2015.lou.cascv.brown.edu task 1
> 
> I'm guessing these mean that there is no socket to the job, since it
> died.  Not sure why this could not be handled more gracefully in PBS.
> 
> > 07/07/2003 15:33:04;0001;   pbs_mom;Svr;pbs_mom;task_check, cannot
> > tm_reply to 2015.lou.cascv.brown.edu task 1
> > 07/07/2003 15:33:04;0001;   pbs_mom;Svr;pbs_mom;task_check, cannot
> > tm_reply to 2015.lou.cascv.brown.edu task 1
> > 07/07/2003 15:33:04;0001;   pbs_mom;Svr;pbs_mom;task_check, cannot
> > tm_reply to 2015.lou.cascv.brown.edu task 1
> > 07/07/2003 15:33:08;0080;   pbs_mom;Job;2015.lou.cascv.brown.edu;task 2
> > terminated
> 
> The second task on the node dying.  Not sure what happened to its
> "starting" message.
> 
> > 07/07/2003 15:33:10;0001;   pbs_mom;Svr;pbs_mom;task_check, cannot
> > tm_reply to 2015.lou.cascv.brown.edu task 1
> > 07/07/2003 15:33:10;0001;   pbs_mom;Svr;pbs_mom;task_check, cannot
> > tm_reply to 2015.lou.cascv.brown.edu task 1
> > 07/07/2003 15:33:10;0008;  
> > pbs_mom;Job;2015.lou.cascv.brown.edu;kill_job
> > 07/07/2003 15:33:10;0080;   pbs_mom;Job;2015.lou.cascv.brown.edu;Obit
> > sent
> > 07/07/2003 15:33:10;0100;   pbs_mom;Req;;Type 54 request received from
> > PBS_Server at frontend-1-16, sock=11
> > 07/07/2003 15:33:10;0100;   pbs_mom;Req;;Type 6 request received from
> > PBS_Server at frontend-1-16, sock=11
> 
> Use an interactive job (qsub -I) to play around with your setup.  Make
> sure that mpirun does the right thing, then maybe just mpiexec -np 1
> or -np 2, working up to your big job.  You may want _not_ to turn off
> stdout and see if that gives you any more information.  Sorry for no
> clear solution yet.
> 
> 		-- Pete
> _______________________________________________
> mpiexec mailing list
> mpiexec at osc.edu
> http://email.osc.edu/mailman/listinfo/mpiexec




More information about the mpiexec mailing list