tm_init: tm: not connected, protocol version 2?

Lisa Thurston lnthurston at ucdavis.edu
Tue Jun 10 16:55:33 EDT 2003


Pete

When I ran the same process without the strace it worked just fine. 
That's beyond my comprehension, but I'm pleased.

Thanks for all your help.
Lisa


On Mon, 2003-06-09 at 08:11, Pete Wyckoff wrote:
> lnthurston at ucdavis.edu said on Thu, 05 Jun 2003 16:57 -0700:
> > Thanks very much for the hint.  I recompiled and now do not have that
> > problem.  Unfortunately I seem to have just traded up for a different
> > connection problem.  I've tried to resolve it myself but once again must
> > admit defeat.  I would greatly appreciate any advice you have to offer.
> 
> > The output of the job looks like this...
> > This jobs runs on the following processors:
> > node5 node5 node4 node4 node3 node3 node2 node2
> > Process 4485 attached
> > pbs_iff: cannot connect to host
> > Process 4485 detached
> > mpiexec: Error: get_hosts: pbs_connect: Unauthorized Request .
> 
> This error message from pbs_iff happens when the TM library called by
> mpiexec uses pbs_connect() to talk to the server to find out information
> about the job.  The error message you quote above means either that
> the host cannot be reached, or nothing is listening on the port number.
> If it were an authentication failure you would have seen something else.
> 
> > The mom_log on node5 is not very informative...
> > 06/05/2003 11:40:44;0100;   pbs_mom;Req;;Type 5 request received from
> > PBS_Server at myriad.ucdavis.edu, sock=10
> > 06/05/2003 11:40:44;0100;   pbs_mom;Req;;Type 19 request received from
> > PBS_Server at myriad.ucdavis.edu, sock=10
> > 06/05/2003 11:40:44;0008;   pbs_mom;Job;3731.myriad.ucdavis.edu;Started,
> > pid = 4443
> > 06/05/2003 11:40:44;0080;   pbs_mom;Job;3731.myriad.ucdavis.edu;task 1
> > terminated
> > 06/05/2003 11:40:44;0008;  
> > pbs_mom;Job;3731.myriad.ucdavis.edu;Terminated
> > 06/05/2003 11:40:44;0008;   pbs_mom;Job;3731.myriad.ucdavis.edu;kill_job
> > 06/05/2003 11:40:44;0080;   pbs_mom;Job;3731.myriad.ucdavis.edu;Obit
> > sent
> 
> The above bothers me since all the timestamps are the same.  If the
> port number on the server were wrong, it still takes 10 seconds for
> pbs_iff to give up.  If the server address were unreachable, it would
> take a whopping 60 seconds to fail.
> 
> > The end of the strace shows...
> > 4485  11:40:44 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 4
> > 4485  11:40:44 bind(4, {sin_family=AF_INET, sin_port=htons(1023),
> > sin_addr=inet_addr("0.0.0.0")}}, 16) = -1 EACCES (Permission denied)
> 
> This is just an artifact of running as non-root.  pbs_iff tries to
> get a priviliged port before doing anything else.  This explains
> why the timestamps are identical above: it exited before even starting
> to try to connect() the socket as it could not get a local port.
> 
> > Running the command (on node5) 
> > pbs_iff -t myriad.ucdavis.edu 15001
> > is successful.
> 
> That says that everything works just fine related to the authentication.
> 
> Can you look at the mom log output for a test without the strace to try
> to figure out why pbs_iff is failing?  Due to the working test above,
> I can't seem to guess why it would not work in practice.  Grasping for
> straws, look around at paths to pbs_iff and pbs_mom to make sure
> everybody is running the correct versions of open vs non-free pbs.
> 
> 		-- Pete




More information about the mpiexec mailing list