mpiexec svn and mvapich 0.97 failure to start jobs

Jimmy Tang jtang at tchpc.tcd.ie
Wed Mar 15 13:08:13 EST 2006


Hi,


With the announcement of mvapich 0.97 in the last day or so, I decided
that it might be nice to run/install the latest and greatest version of mvapich
and mpiexec for our infiniband cluster.

    mpiexec svn checkout - 20060315
    mvapich 0.97
    our IB stack is the voltaire shipped ibhost-3.x
       (the current release 3.5.0 i think)

I tried compiling up cpi.c with 0.97 of mvapich using the latest mpiexec
version and I get this....


------------------------------------------------------------------------------
17:53:15 jtang at iitac128 ~/sandbox/mpiexec-tests $
/usr/support/mpiexec-svn-20060315/bin/mpiexec  -verbose  ./cpi-97
mpiexec: resolve_exe: using absolute exe "./cpi-97".
connect: Connection refused
mpiexec: process_start_event: evt 2 task 0 on iitac128.ib.tchpc.tcd.ie.
mpiexec: All 1 task (spawn 0) started.
mpiexec: read_ib_startup_ports: waiting for checkin: 1 to accept, 0 to
read.
mpiexec: process_obit_event: evt 3 task 0 on iitac128.ib.tchpc.tcd.ie
stat 1.
mpiexec: kill_tasks: killing all tasks.
mpiexec: Warning: task 0 exited with status 1.
------------------------------------------------------------------------------


this version of mpiexec works with the older 0.95 version of mvapich


------------------------------------------------------------------------------
17:56:49 jtang at iitac128 ~/sandbox/mpiexec-tests $
/usr/support/mpiexec-svn-20060315/bin/mpiexec  -verbose  ./cpi-95
mpiexec: resolve_exe: using absolute exe "./cpi-95".
mpiexec: process_start_event: evt 2 task 0 on iitac128.ib.tchpc.tcd.ie.
mpiexec: read_ib_one: version 3 startup.
mpiexec: service_ib_startup: rank 0 in, 0 + 0 left.
mpiexec: All 1 task (spawn 0) started.
mpiexec: read_ib_startup_ports: waiting for checkin: 0 to accept, 0 to
read.
mpiexec: read_ib_startup_ports: barrier start.
mpiexec: read_ib_startup_ports: barrier done.
mpiexec: wait_tasks: waiting for iitac128.ib.tchpc.tcd.ie.
Process 0 on iitac128.tchpc.tcd.ie
pi is approximately 3.1416009869231254, Error is 0.0000083333333323
wall clock time = 0.000082
mpiexec: process_obit_event: evt 3 task 0 on iitac128.ib.tchpc.tcd.ie
stat 0.
------------------------------------------------------------------------------

mvapich was broken with 0.96 as well, anyway, i did an strace of the two
different versions of cpi (against the relavent mvapich's), i've only
attached what i thought might be useful to the developers/bug fixers


here's the tail end of the output from failed job startup with 0.97
------------------------------------------------------------------------------
setsockopt(0, SOL_SOCKET, SO_LINGER, {onoff=1, linger=5}, 8) = 0
connect(0, {sa_family=AF_INET, sin_port=htons(15003),
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
write(0, "+2+12+2928345.login01.ib.tchpc.t"..., 88) = 88
select(1024, [0], NULL, NULL, {0, 0})   = 1 (in [0], left {0, 0})
fcntl(0, F_GETFL)                       = 0x2 (flags O_RDWR|O_LARGEFILE)
read(0, "+2+1+0+3+1", 65536)            = 10
close(0)                                = 0
write(2, "mpiexec: ", 9)                = 9
write(2, "process_obit_event: evt 3 task 0"..., 67) = 67
write(2, ".\n", 2)                      = 2
close(4)                                = 0
select(9, [8], NULL, NULL, {0, 0})      = 0 (Timeout)
nanosleep({0, 200000000}, NULL)         = 0
write(2, "mpiexec: ", 9)                = 9
write(2, "kill_tasks: killing all tasks", 29) = 29
write(2, ".\n", 2)                      = 2
kill(28037, SIGTERM)                    = 0
--- SIGCHLD (Child exited) @ 0 (0) ---
wait4(28037, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 28037
write(2, "mpiexec: Warning: ", 18)      = 18
write(2, "task 0 exited with status 1", 27) = 27
write(2, ".\n", 2)                      = 2
rt_sigaction(SIGHUP, {SIG_DFL}, NULL, 8) = 0
rt_sigaction(SIGINT, {SIG_DFL}, NULL, 8) = 0
rt_sigaction(SIGTERM, {SIG_DFL}, NULL, 8) = 0
close(3)                                = 0
unlink("/tmp/mpiexec-sock/jtang/28345.iitac128.tchpc.tcd.ie") = 0
rmdir("/tmp/mpiexec-sock/jtang")        = 0
rmdir("/tmp/mpiexec-sock")              = 0
exit_group(0x1, 0x1, 0x2a958a7530, 0x2a958a8e08, 0x2a958ab090
<unfinished ... exit status 1>
------------------------------------------------------------------------------


and here's the output from a successful job start up with 0.95
------------------------------------------------------------------------------
setsockopt(0, SOL_SOCKET, SO_LINGER, {onoff=1, linger=5}, 8) = 0
connect(0, {sa_family=AF_INET, sin_port=htons(15003),
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
write(0, "+2+12+2928345.login01.ib.tchpc.t"..., 88) = 88
select(1024, [0], NULL, NULL, {0, 0})   = 0 (Timeout)
select(9, [8], NULL, NULL, {0, 0})      = 0 (Timeout)
accept(4, 0, NULL)                      = 5
fcntl(5, F_GETFL)                       = 0x2 (flags O_RDWR|O_LARGEFILE)
fcntl(5, F_SETFL, O_RDWR)               = 0
poll([{fd=5, events=POLLIN, revents=POLLIN}], 1, 0) = 1
read(5, "\3\0\0\0", 4)                  = 4
read(5, "\0\0\0\0", 4)                  = 4
read(5, "\10\0\0\0", 4)                 = 4
write(2, "mpiexec: ", 9)                = 9
write(2, "read_ib_one: version 3 startup", 30) = 30
write(2, ".\n", 2)                      = 2
read(5, "\313\0\0\0\206\342r\372", 8)   = 8
read(5, "\4\0\0\0", 4)                  = 4
read(5, "\177m\0\0", 4)                 = 4
write(2, "mpiexec: ", 9)                = 9
write(2, "service_ib_startup: rank 0 in, 0"..., 41) = 41
write(2, ".\n", 2)                      = 2
select(1024, [0], NULL, NULL, {0, 0})   = 0 (Timeout)
select(9, [8], NULL, NULL, {0, 0})      = 0 (Timeout)
accept(4, 0, NULL)                      = -1 EAGAIN (Resource
temporarily unavailable)
poll(0x54c500, 0, 0
)                   = 0
select(1024, [0], NULL, NULL, {0, 0})   = 0 (Timeout)
select(9, [8], NULL, NULL, {0, 0})      = 0 (Timeout)
nanosleep({0, 200000000}, NULL)         = 0
write(2, "mpiexec: ", 9)                = 9
write(2, "All 1 task (spawn 0) started", 28) = 28
write(2, ".\n", 2)                      = 2
write(2, "mpiexec: ", 9)                = 9
write(2, "read_ib_startup_ports: waiting f"..., 66) = 66
write(2, ".\n", 2)                      = 2
fcntl(4, F_GETFL)                       = 0x802 (flags
O_RDWR|O_NONBLOCK|O_LARGEFILE)
fcntl(4, F_SETFL, O_RDWR)               = 0
close(4)                                = 0
write(8, "\5\0\0\0", 4)                 = 4
------------------------------------------------------------------------------


It would be great if mpiexec were to work with mvapich 0.97, as it seems
to have some nice features over 0.96/0.95. Any advice on where to poke
to fix the problem would be appreciated.


Thanks,
Jimmy.

-- 
Jimmy Tang
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin.
http://www.tchpc.tcd.ie/


More information about the mpiexec mailing list