tm_init: tm: not connected, protocol version 2?

Pete Wyckoff pw at osc.edu
Mon Jun 9 11:11:14 EDT 2003


lnthurston at ucdavis.edu said on Thu, 05 Jun 2003 16:57 -0700:
> Thanks very much for the hint.  I recompiled and now do not have that
> problem.  Unfortunately I seem to have just traded up for a different
> connection problem.  I've tried to resolve it myself but once again must
> admit defeat.  I would greatly appreciate any advice you have to offer.

> The output of the job looks like this...
> This jobs runs on the following processors:
> node5 node5 node4 node4 node3 node3 node2 node2
> Process 4485 attached
> pbs_iff: cannot connect to host
> Process 4485 detached
> mpiexec: Error: get_hosts: pbs_connect: Unauthorized Request .

This error message from pbs_iff happens when the TM library called by
mpiexec uses pbs_connect() to talk to the server to find out information
about the job.  The error message you quote above means either that
the host cannot be reached, or nothing is listening on the port number.
If it were an authentication failure you would have seen something else.

> The mom_log on node5 is not very informative...
> 06/05/2003 11:40:44;0100;   pbs_mom;Req;;Type 5 request received from
> PBS_Server at myriad.ucdavis.edu, sock=10
> 06/05/2003 11:40:44;0100;   pbs_mom;Req;;Type 19 request received from
> PBS_Server at myriad.ucdavis.edu, sock=10
> 06/05/2003 11:40:44;0008;   pbs_mom;Job;3731.myriad.ucdavis.edu;Started,
> pid = 4443
> 06/05/2003 11:40:44;0080;   pbs_mom;Job;3731.myriad.ucdavis.edu;task 1
> terminated
> 06/05/2003 11:40:44;0008;  
> pbs_mom;Job;3731.myriad.ucdavis.edu;Terminated
> 06/05/2003 11:40:44;0008;   pbs_mom;Job;3731.myriad.ucdavis.edu;kill_job
> 06/05/2003 11:40:44;0080;   pbs_mom;Job;3731.myriad.ucdavis.edu;Obit
> sent

The above bothers me since all the timestamps are the same.  If the
port number on the server were wrong, it still takes 10 seconds for
pbs_iff to give up.  If the server address were unreachable, it would
take a whopping 60 seconds to fail.

> The end of the strace shows...
> 4485  11:40:44 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 4
> 4485  11:40:44 bind(4, {sin_family=AF_INET, sin_port=htons(1023),
> sin_addr=inet_addr("0.0.0.0")}}, 16) = -1 EACCES (Permission denied)

This is just an artifact of running as non-root.  pbs_iff tries to
get a priviliged port before doing anything else.  This explains
why the timestamps are identical above: it exited before even starting
to try to connect() the socket as it could not get a local port.

> Running the command (on node5) 
> pbs_iff -t myriad.ucdavis.edu 15001
> is successful.

That says that everything works just fine related to the authentication.

Can you look at the mom log output for a test without the strace to try
to figure out why pbs_iff is failing?  Due to the working test above,
I can't seem to guess why it would not work in practice.  Grasping for
straws, look around at paths to pbs_iff and pbs_mom to make sure
everybody is running the correct versions of open vs non-free pbs.

		-- Pete



More information about the mpiexec mailing list