mpiexec tm_poll error

Stefan Parnell parnell at msi.umn.edu
Tue Jun 4 17:43:42 EDT 2002


I stubbled upon the answer (kinda funny).  I thought to myself well
I wonder if PBSPro could operate with the pbs_demux of OpenPBS, perhaps
that is the problem since it seems to always be this error "Connection 
refused (111) in open_demux". Well why not give it a try, just backup 
the Pro version and put in the Open version. Hmm, I say to myself, it's
a little odd that there doesn't seem to be this pbs_demux on my client
machines.  Apperently pbs_demux is not in the pbs-mom rpm (in theory
the only one you would need on an execution host). I copied this over 
to all the compute nodes and pow, works like a charm.

So, anyway, I thought you might like to know that it appears it will
work fine with PBS Pro (though I suppose without some of the hacks), 
I think the logs were strange enough that other folks might have the 
same problem.

Stefan

In reply to Pete Wyckoff (pw at osc.edu):

> parnell at msi.umn.edu said:
> > This may be a very simple problem.  I'm using PBS Pro 5.1.4.
> 
> You might be the first to test on pbspro.
> 
> > When I run mpiexec I get this error every time:
> > 
> > mpiexec: Error: wait_one_task_start: tm_poll remote: tm: system error.
> > 
> > in the pbs_mom logs I see the following:
> > 
> > 03/07/2002 17:29:07;0008;   pbs_mom;Job;1838.nf00;Started, pid = 19046
> > 03/07/2002 17:29:14;0001;   pbs_mom;Svr;pbs_mom;Connection refused (111) in open_demux, open_demux: connect 127.0.0.1:33211
> > 03/07/2002 17:29:14;0001;   pbs_mom;Job;1838.nf00;task not started, Failure /bin/tcsh -2
> > 03/07/2002 17:29:20;0001;   pbs_mom;Svr;pbs_mom;Connection refused (111) in open_demux, open_demux: connect 127.0.0.1:33211
> > 03/07/2002 17:29:20;0001;   pbs_mom;Job;1838.nf00;task not started, Failure /bin/tcsh -2
> > 03/07/2002 17:29:20;0001;   pbs_mom;Svr;pbs_mom;task_check, cannot tm_reply to 1838.nf00.msi.umn.edu task 1
> > 03/07/2002 17:29:20;0080;   pbs_mom;Job;1838.nf00;task 1 terminated
> > 03/07/2002 17:29:20;0008;   pbs_mom;Job;1838.nf00;Terminated
> > 
> > In the mom config I list all the compute nodes and localhost as "clienthost"s
> > yet I always get this connection refused error.
> 
> These connection refused indicate likely that the mpiexec stdio listener
> wasn't ready for the connection.  Could be the patch did not go well
> against pbspro, or it could be that mpiexec died oddly before the
> compute process could connect back to it.
> 
> However you can get rid of this by turning off stdio forwarding, as
> such:
> 
>     mpiexec -nostdin -nostdout mycode -args
> 
> Let us know if this seems to work.  If you do want to feed stdin
> to the first process of the job, you could do, (perhaps, untested):
> 
>     mpiexec -nostdin -nostdout "mycode -args < file"
> 
> If you want to help fix the pbspro patch, please do!  I think the PBS
> guys won't complain if we distribute that as well as the openpbs patch,
> if they do turn out to be different.
> 
> 		-- Pete
> _______________________________________________
> mpiexec mailing list
> mpiexec at osc.edu
> http://email.osc.edu/mailman/listinfo/mpiexec

-- 
Stefan Parnell             <parnell at msi.umn.edu>
UNIX Systems Administrator,
University of Minnesota 
Supercomputing Institute for Digital Simulation and Advanced Computation
--



More information about the mpiexec mailing list