mpiexec GMPI_SLAVE env.t problem

Bryan Hellyer brh at unimelb.edu.au
Wed Aug 13 00:07:21 EDT 2003


Just getting back to this again after being tied up on other systems' 
problems since last week.

So anyway, again, if I manually run "hello" via mpirun for 4*2 proc.s, as 
can be (partially) seen
from this extract, mpirun assigns node040 as the master, and nodes 001-004 
as the slaves, and gets
192.168.xx.xx addresses for GMPI_SLAVE.

We have "nodexxx" in /etc/hosts and DNS fo the eth0 interfaces to the nodes 
as a 172.xx.xx.xx private
network, and "myrixxxx" for the myrinet interface addresses on 192.168.xx.xx.

I have "myrixxx" entries in the mpich-gm "machines" file
  /usr/local/mpich-gm/share/machines.ch_gm.LINUX
and mpirun seems to be handling this "translation" OK, but maybe mpiexec isn't.

Any suggestions

[Lots of diag output follows. Sorry folks  :=}]

====== hello with mpirun  (with gmpi_getenv diags via printf's)
119 alfred brh> mpirun -np 8 -nodes 4 ./hello
120 alfred brh> ssh node040
Last login: Mon Aug  4 15:30:36 2003 from genghis
1 node040 brh> cd /data2/users/brh/mpiexec_tst
2 node040 brh> mpirun -np 8 -nodes 4 ./hello
BH: gmpi_conf.c : gethostbyname returned node003
BH: gmpi_getenv var : GMPI_MAGIC , result 4438954
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 8000
BH: gmpi_getenv var : GMPI_SLAVE , result 192.168.1.3
BH: gmpi_getenv var : GMPI_ID , result 4
BH: gmpi_getenv var : GMPI_NP , result 8
BH: gmpi_conf.c : gethostbyname returned node003
BH: gmpi_getenv var : GMPI_MAGIC , result 4438954
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 8000
BH: gmpi_getenv var : GMPI_SLAVE , result 192.168.1.3
BH: gmpi_getenv var : GMPI_ID , result 5
BH: gmpi_getenv var : GMPI_NP , result 8
BH: gmpi_getenv var : GMPI_BOARD , result -1
BH: gmpi_getenv var : GMPI_NUMA_NODE , result (null)
BH: gmpi_getenv var : GMPI_EAGER , result (null)
BH: gmpi_getenv var : GMPI_SHMEM , result 1
BH: gmpi_getenv var : GMPI_RECV , result (null)
BH: gmpi_getenv var : GMPI_BOARD , result -1
BH: gmpi_getenv var : GMPI_NUMA_NODE , result (null)
BH: gmpi_getenv var : GMPI_EAGER , result (null)
BH: gmpi_getenv var : GMPI_SHMEM , result 1
BH: gmpi_getenv var : GMPI_RECV , result (null)
BH: gmpi_conf.c : gethostbyname returned node002
BH: gmpi_conf.c : gethostbyname returned node002
BH: gmpi_getenv var : GMPI_MAGIC , result 4438954
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 8000
BH: gmpi_getenv var : GMPI_MAGIC , result 4438954
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_SLAVE , result 192.168.1.2
etc.

====== host lookups
125 alfred brh> host alfred
alfred.localnet has address 172.20.4.1
126 alfred brh> host node040
node040.localnet has address 172.20.3.40
127 alfred brh> host node001
node001.localnet has address 172.20.3.1
128 alfred brh> host myri040
myri040.localnet has address 192.168.1.40
129 alfred brh> host myri001
myri001.localnet has address 192.168.1.1
130 alfred brh>


====== mpiexec -v -v -> pbs stuff
132 alfred brh> cat testqs
#!/bin/sh
#PBS -q mpi_gm
#PBS -l nodes=4:ppn=2
#PBS -l walltime=5:00
#PBS -l cput=40:00
#PBS -j oe
#PBS -o testqs_oe.out
echo PBS_QUEUE=$PBS_QUEUE  PBS_NODEFILE=$PBS_NODEFILE
cat $PBS_NODEFILE
# env
cd /data2/users/brh/mpiexec_tst
echo "mpiexec -v -v"
echo '[1+pd5>x]sx0lxxq' | dc | strace -f -o /tmp/mpiexec_1.log ./mpiexec -v 
-v  \
   -allstdin ./hello  > testqs_hello.out 2>&1

qstat -an
133 alfred brh>
133 alfred brh> cat testqs_hello.out
stat_exe: testing "./hello"
resolve_exe: using absolute exe "./hello"
node  0: name = node040, mpname = node040, cpu = 1
node  1: name = node040, mpname = node040, cpu = 0
node  2: name = node039, mpname = node039, cpu = 1
node  3: name = node039, mpname = node039, cpu = 0
node  4: name = node038, mpname = node038, cpu = 1
node  5: name = node038, mpname = node038, cpu = 0
node  6: name = node037, mpname = node037, cpu = 1
node  7: name = node037, mpname = node037, cpu = 0
stdio_fork: might listen on one of abort_fd_array (3,4)
stdio_fork: built listener 0 in fd 5 on port 36731
stdio_fork: built listener 1 in fd 6 on port 36732
stdio_fork: built listener 2 in fd 7 on port 36733
Process 29769 attached
Process 29769 detached
goodbye_from_parent: got signal 15, exiting now
command to 0/8
argv  0 /bin/sh
argv  1 -c
argv  2 if test -d "/data2/users/brh/mpiexec_tst"; then cd 
"/data2/users/brh/mpiexec_tst"; fi; exec /bin
/tcsh -c 'exec ./hello'
command to 1/8
argv  0 /bin/sh
argv  1 -c
argv  2 if test -d "/data2/users/brh/mpiexec_tst"; then cd 
"/data2/users/brh/mpiexec_tst"; fi; exec /bin
/tcsh -c 'exec ./hello'
command to 2/8
argv  0 /bin/sh
argv  1 -c
argv  2 if test -d "/data2/users/brh/mpiexec_tst"; then cd 
"/data2/users/brh/mpiexec_tst"; fi; exec /bin
/tcsh -c 'exec ./hello'
command to 3/8
argv  0 /bin/sh
argv  1 -c
argv  2 if test -d "/data2/users/brh/mpiexec_tst"; then cd 
"/data2/users/brh/mpiexec_tst"; fi; exec /bin
/tcsh -c 'exec ./hello'
command to 4/8
argv  0 /bin/sh
argv  1 -c
argv  2 if test -d "/data2/users/brh/mpiexec_tst"; then cd 
"/data2/users/brh/mpiexec_tst"; fi; exec /bin
/tcsh -c 'exec ./hello'
command to 5/8
argv  0 /bin/sh
argv  1 -c
argv  2 if test -d "/data2/users/brh/mpiexec_tst"; then cd 
"/data2/users/brh/mpiexec_tst"; fi; exec /bin
/tcsh -c 'exec ./hello'
command to 6/8
argv  0 /bin/sh
argv  1 -c
argv  2 if test -d "/data2/users/brh/mpiexec_tst"; then cd 
"/data2/users/brh/mpiexec_tst"; fi; exec /bin
/tcsh -c 'exec ./hello'
command to 7/8
argv  0 /bin/sh
argv  1 -c
argv  2 if test -d "/data2/users/brh/mpiexec_tst"; then cd 
"/data2/users/brh/mpiexec_tst"; fi; exec /bin
/tcsh -c 'exec ./hello'
wait_one_task_start: evt = 2, task 0 host node040
wait_one_task_start: evt = 6, task 4 host node038
wait_one_task_start: evt = 8, task 6 host node037
wait_one_task_start: evt = 3, task 1 host node040
wait_one_task_start: evt = 4, task 2 host node039
wait_one_task_start: evt = 5, task 3 host node039
wait_one_task_start: evt = 7, task 5 host node038
wait_one_task_start: evt = 9, task 7 host node037
All 8 tasks started.
read_gm_startup_ports: waiting for info
read_gm_startup_ports: obit check returns 0
read_gm_startup_ports: obit check returns 10
wait_tasks: numspawned = 8, got evt 10 for tid 2 host node040 status 255
wait_tasks: numspawned = 7, got evt 11 for tid 3 host node040 status 255
wait_tasks: numspawned = 6, got evt 12 for tid 4 host node039 status 255
wait_tasks: numspawned = 5, got evt 14 for tid 6 host node038 status 255
wait_tasks: numspawned = 4, got evt 15 for tid 7 host node038 status 255
wait_tasks: numspawned = 3, got evt 13 for tid 5 host node039 status 255
wait_tasks: numspawned = 2, got evt 16 for tid 8 host node037 status 255
wait_tasks: numspawned = 1, got evt 17 for tid 9 host node037 status 255
kill_stdio: sent SIGTERM, waiting on 29769
mpiexec: Warning: main: task 0 exited with status 255.
mpiexec: Warning: main: task 1 exited with status 255.
mpiexec: Warning: main: task 2 exited with status 255.
mpiexec: Warning: main: task 3 exited with status 255.
mpiexec: Warning: main: task 4 exited with status 255.
mpiexec: Warning: main: task 5 exited with status 255.
mpiexec: Warning: main: task 6 exited with status 255.
mpiexec: Warning: main: task 7 exited with status 255.
134 alfred brh>

136 alfred brh> cat  testqs_oe.out
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
PBS_QUEUE=mpi_gm PBS_NODEFILE=/usr/spool/PBS/aux/213.genghis
node040
node040
node039
node039
node038
node038
node037
node037
mpiexec -v -v
BH: gmpi_conf.c : gethostbyname returned node040
BH: gmpi_getenv var : GMPI_MAGIC , result 213
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 36729
BH: gmpi_getenv var : GMPI_SLAVE , result (null)
<MPICH-GM> Error: Need to obtain the slave's hostname in GMPI_SLAVE !
[0] Error: write to socket failed !
BH: gmpi_conf.c : gethostbyname returned node040
BH: gmpi_getenv var : GMPI_MAGIC , result 213
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 36729
BH: gmpi_getenv var : GMPI_SLAVE , result (null)
<MPICH-GM> Error: Need to obtain the slave's hostname in GMPI_SLAVE !
[0] Error: write to socket failed !
BH: gmpi_conf.c : gethostbyname returned node037
BH: gmpi_getenv var : GMPI_MAGIC , result 213
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 36729
BH: gmpi_getenv var : GMPI_SLAVE , result (null)
<MPICH-GM> Error: Need to obtain the slave's hostname in GMPI_SLAVE !
[0] Error: write to socket failed !
BH: gmpi_conf.c : gethostbyname returned node039
BH: gmpi_getenv var : GMPI_MAGIC , result 213
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 36729
BH: gmpi_getenv var : GMPI_SLAVE , result (null)
<MPICH-GM> Error: Need to obtain the slave's hostname in GMBH: gmpi_conf.c 
: gethostbyname returned node
038
BH: gmpi_getenv var : GMPI_MAGIC , result 213
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 36729
BH: gmpi_getenv var : GMPI_SLAVE , result (null)
<MPICH-GMPI_SLAVE !
[0] Error: write to socket failed !
 > Error: Need to obtain the slave's hostname in GMPI_SLAVE !
[0] Error: write to socket failed !
BH: gmpi_conf.c : gethostbyname returned node038
BH: gmpi_getenv var : GMPI_MAGIC , result 213
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 36729
BH: gmpi_getenv var : GMPI_SLAVE , result (null)
<MPICH-GM> Error: Need to obtain the slave's hostname in GMPI_SLAVE !
[0] Error: write to socket failed !
BH: gmpi_conf.c : gethostbyname returned node037
BH: gmpi_getenv var : GMPI_MAGIC , result 213
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 36729
BH: gmpi_getenv var : GMPI_SLAVE , result (null)
<MPICH-GM> Error: Need to obtain the slave's hostname in GMPI_SLAVE !
[0] Error: write to socket failed !
BH: gmpi_conf.c : gethostbyname returned node039
BH: gmpi_getenv var : GMPI_MAGIC , result 213
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 36729
BH: gmpi_getenv var : GMPI_SLAVE , result (null)
<MPICH-GM> Error: Need to obtain the slave's hostname in GMPI_SLAVE !
[0] Error: write to socket failed !

genghis:
                                                             Req'd  Req'd 
Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - 
-----
213.genghis     brh      mpi_gm   testqs      29739   4  --    --  00:40 R   --
    node040/1+node040/0+node039/1+node039/0+node038/1+node038/0+node037/1
    +node037/0
137 alfred brh>


=======

Thanx again
Bryan
---------------------------------------
Bryan Hellyer
HPC Systems Programmer
ITS Systems & Infrastructure
University of Melbourne




More information about the mpiexec mailing list