mpiexec scalability improved!

garrick garrick at usc.edu
Wed Apr 12 20:38:45 EDT 2006


On Wed, Apr 12, 2006 at 03:24:44PM -0700, garrick alleged:
> On Wed, Apr 12, 2006 at 12:49:53PM -0400, Pete Wyckoff alleged:
> > pw at osc.edu wrote on Tue, 11 Apr 2006 10:29 -0400:
> > > Hang onto your patch.  I'll take a crack at converting gm.c to
> > > do periodic servicing without fork and you can see how you like
> > > that.
> > 
> > Are you willing to test my vision for GM async?  Here's a patch.
> > It works here on 4 GM nodes on ia64, and the debug statements
> > appear to show it's doing the right things, but you may run
> > into issues at scale.  I am curious to know if it is as fast
> > as your fork() version or the mpich-gm perl script.
> 
> It does the job in 10-20 seconds, but still failed once in about 10
> runs.

This is definitely failing regularly.  With -v -v, about all I get is
this:

mpiexec: wait_task_start: nwse 76 nws 0 ret 0.
mpiexec: All 1800 tasks (spawn 0) started.
mpiexec: read_gm_startup_ports: waiting for checkin: 3 to accept, 0 to read.
mpiexec: service_gm_startup: no new task, now accept wait 3.
mpiexec: service_gm_startup: no new task, now accept wait 3.
mpiexec: service_gm_startup: no new task, now accept wait 3.
mpiexec: service_gm_startup: no new task, now accept wait 3.
mpiexec: service_gm_startup: no new task, now accept wait 3.
mpiexec: service_gm_startup: no new task, now accept wait 3.
mpiexec: service_gm_startup: no new task, now accept wait 3.
mpiexec: service_gm_startup: no new task, now accept wait 3.
mpiexec: service_gm_startup: no new task, now accept wait 3.
mpiexec: service_gm_startup: no new task, now accept wait 3.
mpiexec: service_gm_startup: no new task, now accept wait 3.
mpiexec: service_gm_startup: no new task, now accept wait 3.
mpiexec: service_gm_startup: no new task, now accept wait 3.
mpiexec: service_gm_startup: no new task, now accept wait 3.
mpiexec: service_gm_startup: accepted fd 6, accept wait 2.
mpiexec: service_gm_startup: reading fd 6, read wait 0.
mpiexec: read_gm_one: rank 1539 in, 2 + 0 left.
mpiexec: service_gm_startup: no new task, now accept wait 2.
mpiexec: service_gm_startup: no new task, now accept wait 2.
mpiexec: service_gm_startup: accepted fd 6, accept wait 1.
mpiexec: service_gm_startup: reading fd 6, read wait 0.
mpiexec: read_gm_one: rank 1411 in, 1 + 0 left.
mpiexec: service_gm_startup: no new task, now accept wait 1.
mpiexec: service_gm_startup: accepted fd 6, accept wait 0.
mpiexec: service_gm_startup: reading fd 6, read wait 0.
mpiexec: read_gm_one: rank 1559 in, 0 + 0 left.
mpiexec: wait_tasks: waiting for hpc0636 hpc0636 hpc0636 and 1797 others.
mpiexec: listen_abort_fd: parent says via index 0 to listen to abort fd 4.
mpiexec: Error: do_child: input on unexpected fd 10.



-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://email.osc.edu/pipermail/mpiexec/attachments/20060412/2e689d03/attachment.bin


More information about the mpiexec mailing list