mpiexec tm_poll error
Stefan Parnell
parnell at msi.umn.edu
Tue Jun 4 17:43:42 EDT 2002
I stubbled upon the answer (kinda funny). I thought to myself well
I wonder if PBSPro could operate with the pbs_demux of OpenPBS, perhaps
that is the problem since it seems to always be this error "Connection
refused (111) in open_demux". Well why not give it a try, just backup
the Pro version and put in the Open version. Hmm, I say to myself, it's
a little odd that there doesn't seem to be this pbs_demux on my client
machines. Apperently pbs_demux is not in the pbs-mom rpm (in theory
the only one you would need on an execution host). I copied this over
to all the compute nodes and pow, works like a charm.
So, anyway, I thought you might like to know that it appears it will
work fine with PBS Pro (though I suppose without some of the hacks),
I think the logs were strange enough that other folks might have the
same problem.
Stefan
In reply to Pete Wyckoff (pw at osc.edu):
> parnell at msi.umn.edu said:
> > This may be a very simple problem. I'm using PBS Pro 5.1.4.
>
> You might be the first to test on pbspro.
>
> > When I run mpiexec I get this error every time:
> >
> > mpiexec: Error: wait_one_task_start: tm_poll remote: tm: system error.
> >
> > in the pbs_mom logs I see the following:
> >
> > 03/07/2002 17:29:07;0008; pbs_mom;Job;1838.nf00;Started, pid = 19046
> > 03/07/2002 17:29:14;0001; pbs_mom;Svr;pbs_mom;Connection refused (111) in open_demux, open_demux: connect 127.0.0.1:33211
> > 03/07/2002 17:29:14;0001; pbs_mom;Job;1838.nf00;task not started, Failure /bin/tcsh -2
> > 03/07/2002 17:29:20;0001; pbs_mom;Svr;pbs_mom;Connection refused (111) in open_demux, open_demux: connect 127.0.0.1:33211
> > 03/07/2002 17:29:20;0001; pbs_mom;Job;1838.nf00;task not started, Failure /bin/tcsh -2
> > 03/07/2002 17:29:20;0001; pbs_mom;Svr;pbs_mom;task_check, cannot tm_reply to 1838.nf00.msi.umn.edu task 1
> > 03/07/2002 17:29:20;0080; pbs_mom;Job;1838.nf00;task 1 terminated
> > 03/07/2002 17:29:20;0008; pbs_mom;Job;1838.nf00;Terminated
> >
> > In the mom config I list all the compute nodes and localhost as "clienthost"s
> > yet I always get this connection refused error.
>
> These connection refused indicate likely that the mpiexec stdio listener
> wasn't ready for the connection. Could be the patch did not go well
> against pbspro, or it could be that mpiexec died oddly before the
> compute process could connect back to it.
>
> However you can get rid of this by turning off stdio forwarding, as
> such:
>
> mpiexec -nostdin -nostdout mycode -args
>
> Let us know if this seems to work. If you do want to feed stdin
> to the first process of the job, you could do, (perhaps, untested):
>
> mpiexec -nostdin -nostdout "mycode -args < file"
>
> If you want to help fix the pbspro patch, please do! I think the PBS
> guys won't complain if we distribute that as well as the openpbs patch,
> if they do turn out to be different.
>
> -- Pete
> _______________________________________________
> mpiexec mailing list
> mpiexec at osc.edu
> http://email.osc.edu/mailman/listinfo/mpiexec
--
Stefan Parnell <parnell at msi.umn.edu>
UNIX Systems Administrator,
University of Minnesota
Supercomputing Institute for Digital Simulation and Advanced Computation
--
More information about the mpiexec
mailing list