mpiexec svn and mvapich 0.97 failure to start jobs
Jimmy Tang
jtang at tchpc.tcd.ie
Wed Mar 15 13:08:13 EST 2006
Hi,
With the announcement of mvapich 0.97 in the last day or so, I decided
that it might be nice to run/install the latest and greatest version of mvapich
and mpiexec for our infiniband cluster.
mpiexec svn checkout - 20060315
mvapich 0.97
our IB stack is the voltaire shipped ibhost-3.x
(the current release 3.5.0 i think)
I tried compiling up cpi.c with 0.97 of mvapich using the latest mpiexec
version and I get this....
------------------------------------------------------------------------------
17:53:15 jtang at iitac128 ~/sandbox/mpiexec-tests $
/usr/support/mpiexec-svn-20060315/bin/mpiexec -verbose ./cpi-97
mpiexec: resolve_exe: using absolute exe "./cpi-97".
connect: Connection refused
mpiexec: process_start_event: evt 2 task 0 on iitac128.ib.tchpc.tcd.ie.
mpiexec: All 1 task (spawn 0) started.
mpiexec: read_ib_startup_ports: waiting for checkin: 1 to accept, 0 to
read.
mpiexec: process_obit_event: evt 3 task 0 on iitac128.ib.tchpc.tcd.ie
stat 1.
mpiexec: kill_tasks: killing all tasks.
mpiexec: Warning: task 0 exited with status 1.
------------------------------------------------------------------------------
this version of mpiexec works with the older 0.95 version of mvapich
------------------------------------------------------------------------------
17:56:49 jtang at iitac128 ~/sandbox/mpiexec-tests $
/usr/support/mpiexec-svn-20060315/bin/mpiexec -verbose ./cpi-95
mpiexec: resolve_exe: using absolute exe "./cpi-95".
mpiexec: process_start_event: evt 2 task 0 on iitac128.ib.tchpc.tcd.ie.
mpiexec: read_ib_one: version 3 startup.
mpiexec: service_ib_startup: rank 0 in, 0 + 0 left.
mpiexec: All 1 task (spawn 0) started.
mpiexec: read_ib_startup_ports: waiting for checkin: 0 to accept, 0 to
read.
mpiexec: read_ib_startup_ports: barrier start.
mpiexec: read_ib_startup_ports: barrier done.
mpiexec: wait_tasks: waiting for iitac128.ib.tchpc.tcd.ie.
Process 0 on iitac128.tchpc.tcd.ie
pi is approximately 3.1416009869231254, Error is 0.0000083333333323
wall clock time = 0.000082
mpiexec: process_obit_event: evt 3 task 0 on iitac128.ib.tchpc.tcd.ie
stat 0.
------------------------------------------------------------------------------
mvapich was broken with 0.96 as well, anyway, i did an strace of the two
different versions of cpi (against the relavent mvapich's), i've only
attached what i thought might be useful to the developers/bug fixers
here's the tail end of the output from failed job startup with 0.97
------------------------------------------------------------------------------
setsockopt(0, SOL_SOCKET, SO_LINGER, {onoff=1, linger=5}, 8) = 0
connect(0, {sa_family=AF_INET, sin_port=htons(15003),
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
write(0, "+2+12+2928345.login01.ib.tchpc.t"..., 88) = 88
select(1024, [0], NULL, NULL, {0, 0}) = 1 (in [0], left {0, 0})
fcntl(0, F_GETFL) = 0x2 (flags O_RDWR|O_LARGEFILE)
read(0, "+2+1+0+3+1", 65536) = 10
close(0) = 0
write(2, "mpiexec: ", 9) = 9
write(2, "process_obit_event: evt 3 task 0"..., 67) = 67
write(2, ".\n", 2) = 2
close(4) = 0
select(9, [8], NULL, NULL, {0, 0}) = 0 (Timeout)
nanosleep({0, 200000000}, NULL) = 0
write(2, "mpiexec: ", 9) = 9
write(2, "kill_tasks: killing all tasks", 29) = 29
write(2, ".\n", 2) = 2
kill(28037, SIGTERM) = 0
--- SIGCHLD (Child exited) @ 0 (0) ---
wait4(28037, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 28037
write(2, "mpiexec: Warning: ", 18) = 18
write(2, "task 0 exited with status 1", 27) = 27
write(2, ".\n", 2) = 2
rt_sigaction(SIGHUP, {SIG_DFL}, NULL, 8) = 0
rt_sigaction(SIGINT, {SIG_DFL}, NULL, 8) = 0
rt_sigaction(SIGTERM, {SIG_DFL}, NULL, 8) = 0
close(3) = 0
unlink("/tmp/mpiexec-sock/jtang/28345.iitac128.tchpc.tcd.ie") = 0
rmdir("/tmp/mpiexec-sock/jtang") = 0
rmdir("/tmp/mpiexec-sock") = 0
exit_group(0x1, 0x1, 0x2a958a7530, 0x2a958a8e08, 0x2a958ab090
<unfinished ... exit status 1>
------------------------------------------------------------------------------
and here's the output from a successful job start up with 0.95
------------------------------------------------------------------------------
setsockopt(0, SOL_SOCKET, SO_LINGER, {onoff=1, linger=5}, 8) = 0
connect(0, {sa_family=AF_INET, sin_port=htons(15003),
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
write(0, "+2+12+2928345.login01.ib.tchpc.t"..., 88) = 88
select(1024, [0], NULL, NULL, {0, 0}) = 0 (Timeout)
select(9, [8], NULL, NULL, {0, 0}) = 0 (Timeout)
accept(4, 0, NULL) = 5
fcntl(5, F_GETFL) = 0x2 (flags O_RDWR|O_LARGEFILE)
fcntl(5, F_SETFL, O_RDWR) = 0
poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, 0) = 1
read(5, "\3\0\0\0", 4) = 4
read(5, "\0\0\0\0", 4) = 4
read(5, "\10\0\0\0", 4) = 4
write(2, "mpiexec: ", 9) = 9
write(2, "read_ib_one: version 3 startup", 30) = 30
write(2, ".\n", 2) = 2
read(5, "\313\0\0\0\206\342r\372", 8) = 8
read(5, "\4\0\0\0", 4) = 4
read(5, "\177m\0\0", 4) = 4
write(2, "mpiexec: ", 9) = 9
write(2, "service_ib_startup: rank 0 in, 0"..., 41) = 41
write(2, ".\n", 2) = 2
select(1024, [0], NULL, NULL, {0, 0}) = 0 (Timeout)
select(9, [8], NULL, NULL, {0, 0}) = 0 (Timeout)
accept(4, 0, NULL) = -1 EAGAIN (Resource
temporarily unavailable)
poll(0x54c500, 0, 0
) = 0
select(1024, [0], NULL, NULL, {0, 0}) = 0 (Timeout)
select(9, [8], NULL, NULL, {0, 0}) = 0 (Timeout)
nanosleep({0, 200000000}, NULL) = 0
write(2, "mpiexec: ", 9) = 9
write(2, "All 1 task (spawn 0) started", 28) = 28
write(2, ".\n", 2) = 2
write(2, "mpiexec: ", 9) = 9
write(2, "read_ib_startup_ports: waiting f"..., 66) = 66
write(2, ".\n", 2) = 2
fcntl(4, F_GETFL) = 0x802 (flags
O_RDWR|O_NONBLOCK|O_LARGEFILE)
fcntl(4, F_SETFL, O_RDWR) = 0
close(4) = 0
write(8, "\5\0\0\0", 4) = 4
------------------------------------------------------------------------------
It would be great if mpiexec were to work with mvapich 0.97, as it seems
to have some nice features over 0.96/0.95. Any advice on where to poke
to fix the problem would be appreciated.
Thanks,
Jimmy.
--
Jimmy Tang
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin.
http://www.tchpc.tcd.ie/
More information about the mpiexec
mailing list