mpiexec scalability improved!

garrick garrick at usc.edu
Wed Apr 12 22:42:54 EDT 2006


On Wed, Apr 12, 2006 at 05:38:45PM -0700, garrick alleged:
> On Wed, Apr 12, 2006 at 03:24:44PM -0700, garrick alleged:
> > On Wed, Apr 12, 2006 at 12:49:53PM -0400, Pete Wyckoff alleged:
> > > pw at osc.edu wrote on Tue, 11 Apr 2006 10:29 -0400:
> > > > Hang onto your patch.  I'll take a crack at converting gm.c to
> > > > do periodic servicing without fork and you can see how you like
> > > > that.
> > > 
> > > Are you willing to test my vision for GM async?  Here's a patch.
> > > It works here on 4 GM nodes on ia64, and the debug statements
> > > appear to show it's doing the right things, but you may run
> > > into issues at scale.  I am curious to know if it is as fast
> > > as your fork() version or the mpich-gm perl script.
> > 
> > It does the job in 10-20 seconds, but still failed once in about 10
> > runs.
> 
> This is definitely failing regularly.  With -v -v, about all I get is
> this:
> mpiexec: Error: do_child: input on unexpected fd 10.

fd 10 is the pipe from the parent and the handler isn't clearing it
from the read fd list.

One line patch to do_child() and it is behaving very very nicely:
    /* message from parent */
    if (poll_isset(pipe_with_stdio, &rfsd)) {
        poll_clr(pipe_with_stdio, &rfsd);      <-- missing
        --n;
        stdio_msg_listener_read();
    }

I've run dozens of tests on 1800 nodes without any failures.

Will we see this in a release or svn soon?

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://email.osc.edu/pipermail/mpiexec/attachments/20060412/27ba36ec/attachment-0001.bin


More information about the mpiexec mailing list