mpiexec GMPI_SLAVE env.t problem
Bryan Hellyer
brh at unimelb.edu.au
Wed Aug 13 00:07:21 EDT 2003
Just getting back to this again after being tied up on other systems'
problems since last week.
So anyway, again, if I manually run "hello" via mpirun for 4*2 proc.s, as
can be (partially) seen
from this extract, mpirun assigns node040 as the master, and nodes 001-004
as the slaves, and gets
192.168.xx.xx addresses for GMPI_SLAVE.
We have "nodexxx" in /etc/hosts and DNS fo the eth0 interfaces to the nodes
as a 172.xx.xx.xx private
network, and "myrixxxx" for the myrinet interface addresses on 192.168.xx.xx.
I have "myrixxx" entries in the mpich-gm "machines" file
/usr/local/mpich-gm/share/machines.ch_gm.LINUX
and mpirun seems to be handling this "translation" OK, but maybe mpiexec isn't.
Any suggestions
[Lots of diag output follows. Sorry folks :=}]
====== hello with mpirun (with gmpi_getenv diags via printf's)
119 alfred brh> mpirun -np 8 -nodes 4 ./hello
120 alfred brh> ssh node040
Last login: Mon Aug 4 15:30:36 2003 from genghis
1 node040 brh> cd /data2/users/brh/mpiexec_tst
2 node040 brh> mpirun -np 8 -nodes 4 ./hello
BH: gmpi_conf.c : gethostbyname returned node003
BH: gmpi_getenv var : GMPI_MAGIC , result 4438954
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 8000
BH: gmpi_getenv var : GMPI_SLAVE , result 192.168.1.3
BH: gmpi_getenv var : GMPI_ID , result 4
BH: gmpi_getenv var : GMPI_NP , result 8
BH: gmpi_conf.c : gethostbyname returned node003
BH: gmpi_getenv var : GMPI_MAGIC , result 4438954
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 8000
BH: gmpi_getenv var : GMPI_SLAVE , result 192.168.1.3
BH: gmpi_getenv var : GMPI_ID , result 5
BH: gmpi_getenv var : GMPI_NP , result 8
BH: gmpi_getenv var : GMPI_BOARD , result -1
BH: gmpi_getenv var : GMPI_NUMA_NODE , result (null)
BH: gmpi_getenv var : GMPI_EAGER , result (null)
BH: gmpi_getenv var : GMPI_SHMEM , result 1
BH: gmpi_getenv var : GMPI_RECV , result (null)
BH: gmpi_getenv var : GMPI_BOARD , result -1
BH: gmpi_getenv var : GMPI_NUMA_NODE , result (null)
BH: gmpi_getenv var : GMPI_EAGER , result (null)
BH: gmpi_getenv var : GMPI_SHMEM , result 1
BH: gmpi_getenv var : GMPI_RECV , result (null)
BH: gmpi_conf.c : gethostbyname returned node002
BH: gmpi_conf.c : gethostbyname returned node002
BH: gmpi_getenv var : GMPI_MAGIC , result 4438954
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 8000
BH: gmpi_getenv var : GMPI_MAGIC , result 4438954
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_SLAVE , result 192.168.1.2
etc.
====== host lookups
125 alfred brh> host alfred
alfred.localnet has address 172.20.4.1
126 alfred brh> host node040
node040.localnet has address 172.20.3.40
127 alfred brh> host node001
node001.localnet has address 172.20.3.1
128 alfred brh> host myri040
myri040.localnet has address 192.168.1.40
129 alfred brh> host myri001
myri001.localnet has address 192.168.1.1
130 alfred brh>
====== mpiexec -v -v -> pbs stuff
132 alfred brh> cat testqs
#!/bin/sh
#PBS -q mpi_gm
#PBS -l nodes=4:ppn=2
#PBS -l walltime=5:00
#PBS -l cput=40:00
#PBS -j oe
#PBS -o testqs_oe.out
echo PBS_QUEUE=$PBS_QUEUE PBS_NODEFILE=$PBS_NODEFILE
cat $PBS_NODEFILE
# env
cd /data2/users/brh/mpiexec_tst
echo "mpiexec -v -v"
echo '[1+pd5>x]sx0lxxq' | dc | strace -f -o /tmp/mpiexec_1.log ./mpiexec -v
-v \
-allstdin ./hello > testqs_hello.out 2>&1
qstat -an
133 alfred brh>
133 alfred brh> cat testqs_hello.out
stat_exe: testing "./hello"
resolve_exe: using absolute exe "./hello"
node 0: name = node040, mpname = node040, cpu = 1
node 1: name = node040, mpname = node040, cpu = 0
node 2: name = node039, mpname = node039, cpu = 1
node 3: name = node039, mpname = node039, cpu = 0
node 4: name = node038, mpname = node038, cpu = 1
node 5: name = node038, mpname = node038, cpu = 0
node 6: name = node037, mpname = node037, cpu = 1
node 7: name = node037, mpname = node037, cpu = 0
stdio_fork: might listen on one of abort_fd_array (3,4)
stdio_fork: built listener 0 in fd 5 on port 36731
stdio_fork: built listener 1 in fd 6 on port 36732
stdio_fork: built listener 2 in fd 7 on port 36733
Process 29769 attached
Process 29769 detached
goodbye_from_parent: got signal 15, exiting now
command to 0/8
argv 0 /bin/sh
argv 1 -c
argv 2 if test -d "/data2/users/brh/mpiexec_tst"; then cd
"/data2/users/brh/mpiexec_tst"; fi; exec /bin
/tcsh -c 'exec ./hello'
command to 1/8
argv 0 /bin/sh
argv 1 -c
argv 2 if test -d "/data2/users/brh/mpiexec_tst"; then cd
"/data2/users/brh/mpiexec_tst"; fi; exec /bin
/tcsh -c 'exec ./hello'
command to 2/8
argv 0 /bin/sh
argv 1 -c
argv 2 if test -d "/data2/users/brh/mpiexec_tst"; then cd
"/data2/users/brh/mpiexec_tst"; fi; exec /bin
/tcsh -c 'exec ./hello'
command to 3/8
argv 0 /bin/sh
argv 1 -c
argv 2 if test -d "/data2/users/brh/mpiexec_tst"; then cd
"/data2/users/brh/mpiexec_tst"; fi; exec /bin
/tcsh -c 'exec ./hello'
command to 4/8
argv 0 /bin/sh
argv 1 -c
argv 2 if test -d "/data2/users/brh/mpiexec_tst"; then cd
"/data2/users/brh/mpiexec_tst"; fi; exec /bin
/tcsh -c 'exec ./hello'
command to 5/8
argv 0 /bin/sh
argv 1 -c
argv 2 if test -d "/data2/users/brh/mpiexec_tst"; then cd
"/data2/users/brh/mpiexec_tst"; fi; exec /bin
/tcsh -c 'exec ./hello'
command to 6/8
argv 0 /bin/sh
argv 1 -c
argv 2 if test -d "/data2/users/brh/mpiexec_tst"; then cd
"/data2/users/brh/mpiexec_tst"; fi; exec /bin
/tcsh -c 'exec ./hello'
command to 7/8
argv 0 /bin/sh
argv 1 -c
argv 2 if test -d "/data2/users/brh/mpiexec_tst"; then cd
"/data2/users/brh/mpiexec_tst"; fi; exec /bin
/tcsh -c 'exec ./hello'
wait_one_task_start: evt = 2, task 0 host node040
wait_one_task_start: evt = 6, task 4 host node038
wait_one_task_start: evt = 8, task 6 host node037
wait_one_task_start: evt = 3, task 1 host node040
wait_one_task_start: evt = 4, task 2 host node039
wait_one_task_start: evt = 5, task 3 host node039
wait_one_task_start: evt = 7, task 5 host node038
wait_one_task_start: evt = 9, task 7 host node037
All 8 tasks started.
read_gm_startup_ports: waiting for info
read_gm_startup_ports: obit check returns 0
read_gm_startup_ports: obit check returns 10
wait_tasks: numspawned = 8, got evt 10 for tid 2 host node040 status 255
wait_tasks: numspawned = 7, got evt 11 for tid 3 host node040 status 255
wait_tasks: numspawned = 6, got evt 12 for tid 4 host node039 status 255
wait_tasks: numspawned = 5, got evt 14 for tid 6 host node038 status 255
wait_tasks: numspawned = 4, got evt 15 for tid 7 host node038 status 255
wait_tasks: numspawned = 3, got evt 13 for tid 5 host node039 status 255
wait_tasks: numspawned = 2, got evt 16 for tid 8 host node037 status 255
wait_tasks: numspawned = 1, got evt 17 for tid 9 host node037 status 255
kill_stdio: sent SIGTERM, waiting on 29769
mpiexec: Warning: main: task 0 exited with status 255.
mpiexec: Warning: main: task 1 exited with status 255.
mpiexec: Warning: main: task 2 exited with status 255.
mpiexec: Warning: main: task 3 exited with status 255.
mpiexec: Warning: main: task 4 exited with status 255.
mpiexec: Warning: main: task 5 exited with status 255.
mpiexec: Warning: main: task 6 exited with status 255.
mpiexec: Warning: main: task 7 exited with status 255.
134 alfred brh>
136 alfred brh> cat testqs_oe.out
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
PBS_QUEUE=mpi_gm PBS_NODEFILE=/usr/spool/PBS/aux/213.genghis
node040
node040
node039
node039
node038
node038
node037
node037
mpiexec -v -v
BH: gmpi_conf.c : gethostbyname returned node040
BH: gmpi_getenv var : GMPI_MAGIC , result 213
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 36729
BH: gmpi_getenv var : GMPI_SLAVE , result (null)
<MPICH-GM> Error: Need to obtain the slave's hostname in GMPI_SLAVE !
[0] Error: write to socket failed !
BH: gmpi_conf.c : gethostbyname returned node040
BH: gmpi_getenv var : GMPI_MAGIC , result 213
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 36729
BH: gmpi_getenv var : GMPI_SLAVE , result (null)
<MPICH-GM> Error: Need to obtain the slave's hostname in GMPI_SLAVE !
[0] Error: write to socket failed !
BH: gmpi_conf.c : gethostbyname returned node037
BH: gmpi_getenv var : GMPI_MAGIC , result 213
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 36729
BH: gmpi_getenv var : GMPI_SLAVE , result (null)
<MPICH-GM> Error: Need to obtain the slave's hostname in GMPI_SLAVE !
[0] Error: write to socket failed !
BH: gmpi_conf.c : gethostbyname returned node039
BH: gmpi_getenv var : GMPI_MAGIC , result 213
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 36729
BH: gmpi_getenv var : GMPI_SLAVE , result (null)
<MPICH-GM> Error: Need to obtain the slave's hostname in GMBH: gmpi_conf.c
: gethostbyname returned node
038
BH: gmpi_getenv var : GMPI_MAGIC , result 213
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 36729
BH: gmpi_getenv var : GMPI_SLAVE , result (null)
<MPICH-GMPI_SLAVE !
[0] Error: write to socket failed !
> Error: Need to obtain the slave's hostname in GMPI_SLAVE !
[0] Error: write to socket failed !
BH: gmpi_conf.c : gethostbyname returned node038
BH: gmpi_getenv var : GMPI_MAGIC , result 213
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 36729
BH: gmpi_getenv var : GMPI_SLAVE , result (null)
<MPICH-GM> Error: Need to obtain the slave's hostname in GMPI_SLAVE !
[0] Error: write to socket failed !
BH: gmpi_conf.c : gethostbyname returned node037
BH: gmpi_getenv var : GMPI_MAGIC , result 213
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 36729
BH: gmpi_getenv var : GMPI_SLAVE , result (null)
<MPICH-GM> Error: Need to obtain the slave's hostname in GMPI_SLAVE !
[0] Error: write to socket failed !
BH: gmpi_conf.c : gethostbyname returned node039
BH: gmpi_getenv var : GMPI_MAGIC , result 213
BH: gmpi_getenv var : GMPI_MASTER , result node040
BH: gmpi_getenv var : GMPI_PORT , result 36729
BH: gmpi_getenv var : GMPI_SLAVE , result (null)
<MPICH-GM> Error: Need to obtain the slave's hostname in GMPI_SLAVE !
[0] Error: write to socket failed !
genghis:
Req'd Req'd
Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- -
-----
213.genghis brh mpi_gm testqs 29739 4 -- -- 00:40 R --
node040/1+node040/0+node039/1+node039/0+node038/1+node038/0+node037/1
+node037/0
137 alfred brh>
=======
Thanx again
Bryan
---------------------------------------
Bryan Hellyer
HPC Systems Programmer
ITS Systems & Infrastructure
University of Melbourne
More information about the mpiexec
mailing list