mpirun with mpich-gm and pbs

Adam Gray graya at BATTELLE.ORG
Mon Dec 12 16:15:10 EST 2005


It works fine now.

What differences are there between the way mpich and mpich-gm report IPs to 
mpiexec? This whole problem seems to be because mpich-gm gives the first 
hostname match and mpiexec gives something else (the last?). I understand 
that this may be out of the realm of the issue now, but I'm curious in case 
something else may arise.

Thanks again for your help,

Adam Gray

On Monday 12 December 2005 3:39 pm, Tornes, Ivan E wrote:
> You are right we have in our /etc/hosts file
>
> 127.0.0.1 shrikenode01 localhost
> 192.168.100.1 shrikenode01
>
> We did not think that this was wrong b/c on our cluster where we use
> standard ethernet as the network the /etc/hosts file are like above and it
> works fine with mpiexec -0.76.  I will change this and hopefully that will
> fix everthing.  Thanks.
>
> Ivan
>
>
>
> ________________________________
>
> From: Pete Wyckoff [mailto:pw at osc.edu]
> Sent: Mon 12/12/2005 3:23 PM
> To: Tornes, Ivan E
> Cc: mpiexec at osc.edu
> Subject: Re: mpirun with mpich-gm and pbs
>
> tornesi at BATTELLE.ORG wrote on Mon, 12 Dec 2005 13:59 -0500:
> > Here are the strace.out files after running the hello program.
> > mpiexec is looking for libmpich.so.1.0 which is in
> > /usr/local/mpich/intel/lib/shared/ on our system, but as you can
> > see from the error messages it is clearly looking for it somewhere
> > else.   Not sure what is going on here.  We have mpich-gm compiled
> > with both the intel and gnu compilers.
> >
> > The error message we get back from pbs is
> >
> > hello: shrikenode02  MPI_Init did not finish
> > hello: shrikenode01  MPI_Init did not finish
> >
> >
> > The respective strace.out files are strace_node01.out and
> > strace_node02.out
> >
> > This test was run using mpiexec-0.80 without the patch that you mentioned
> > for gm.c
>
> I'm insisting you have something awry in your /etc/hosts on the two
> nodes.  Your node01 does this:
>
>     bind(4, {sa_family=AF_INET, sin_port=htons(0),
> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 getsockname(4,
> {sa_family=AF_INET, sin_port=htons(39818),
> sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 listen(4, 1)                   
>         = 0
>     [..]
>     connect(5, {sa_family=AF_INET, sin_port=htons(32852),
> sin_addr=inet_addr("192.168.100.2")}, 16) = -1 ECONNREFUSED (Connection
> refused)
>
> Your node02 does this:
>
>     bind(4, {sa_family=AF_INET, sin_port=htons(0),
> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 getsockname(4,
> {sa_family=AF_INET, sin_port=htons(32861),
> sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 listen(4, 1)                   
>         = 0
>     [..]
>     connect(5, {sa_family=AF_INET, sin_port=htons(32852),
> sin_addr=inet_addr("127.0.0.1")}, 16) = 0
>
> That first bind/listen socket (4) is the one that "hello" listens on for
> responses from mpiexec.  The second socket (5) is the one that
> initially connects to mpiexec.
>
> Both of the listen on 127.0.0.1, this is bad:  they need to listen
> on the real address of the box, not the loopback interface.
>
> node01 tries to connect to 192.168.100.2, which sounds right, but
> nobody is listening there.  node02 tries to connect to loopback and
> connects to the mpiexec on that host.
>
> Send me the host files from node01 and node02, and the output from
> "hostname ; ip a s" from each.  I suspect you have a line like this:
>
>     127.0.0.1  shrikenode01
>
> when you really should have:
>
>     127.0.0.1 localhost
>     192.168.100.1 shrikenode01
>
> Perhaps you can confirm?
>
>                 -- Pete
>
>
> _______________________________________________
> mpiexec mailing list
> mpiexec at osc.edu
> http://email.osc.edu/mailman/listinfo/mpiexec


More information about the mpiexec mailing list