launching GM jobs is too slow

Garrick Staples garrick at usc.edu
Thu Nov 10 22:02:24 EST 2005


I've been investigating why I'm having a hard time launching
mpichgm-1.2.6..14a jobs with more than 500 CPUs.  It seems that the time
to actually boot MPI is taking too long.

With the mpichgm mpirun perl script using ssh1, I can launch a 1000 CPU
job in a few seconds.  A simple MPI helloworld will launch and complete
in 9 seconds.

Using strace and some strategic printf's tells me that mpiexec spends a
lot of time waiting on the read() inside of gm.c:read_gm_startup_ports().
The accept() calls return just fine, but the read() calls can take
several seconds:

     0.000000 accept(4, 0, NULL)        = 6
     0.000000 read(6, "<", 1)           = 1
     2.609772 read(6, "<", 1)           = 1
...
     0.000000 accept(4, 0, NULL)        = 6
     0.000000 read(6, "<", 1)           = 1
     7.619335 read(6, "<", 1)           = 1
...
     0.000000 nanosleep({0, 200000000}, NULL) = 0
     0.209981 accept(4, 0, NULL)        = 6
     0.000000 read(6, "<", 1)           = 1
    19.578293 read(6, "<", 1)           = 1


The thing that really baffles me is that mpiexec appears to do almost
the same thing as the mpichgm perl script, but for some reason the perl
script never gets blocked.

I've been futzing with the mpiexec code, trying random things like using
a blocking socket, use recv() instead of read(), etc.  But nothing seems
to cure those slow reads() from the network.


Some actual timings with the latest TORQUE-2.0.0p2, RHEL3 x86_64,
mpichgm-1.2.6..14a, and mpiexec-0.80 on 480 CPUs:

A few hundred trivial TM tasks is fine:
$ time mpiexec -comm none /bin/true
real    0m20.888s

$ time pbsdsh /bin/true
real    0m21.168s


A few hundred TM tasks that live for awhile is fine:
$ time mpiexec -comm none bash -c "\"sleep 60;true\""
real    1m21.453s

$ time pbsdsh bash -c "sleep 60;true"
real    1m21.363s


But an actual GM MPI job is a problem:
$ time mpirun -allcpus ./helloworld
real    0m7.319s

$ time mpiexec ./mpitest/helloworld
real    1m4.664s
... past 500 CPUs I tend to just get "<<<...>>> string not recognized"
(which is <<<ABORT_magic_ABORT>>> from a node that timed out)

$ time mpiexec -v ./mpitest/helloworld
mpiexec: resolve_exe: using absolute exe "./mpitest/helloworld".
...
mpiexec: All 480 tasks started.
read_gm_startup_ports: waiting for info
read_gm_startup_ports: mpich gm version 12510
read_gm_startup_ports: id 1 port 2 board 96 gm_node_id 0xdd4832d8
  numanode 0 pid 30181 remote_port 14249
... this part takes over a minute, just reading in the node info

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://email.osc.edu/pipermail/mpiexec/attachments/20051110/41f4a809/attachment.bin


More information about the mpiexec mailing list