Mpiexec release 0.73, fixes and odd enhancements

Pete Wyckoff pw at osc.edu
Fri Mar 21 09:32:38 EST 2003


Changes from 0.71 are a mixed bag, although nothing is critical enough
to demand an upgrade.  (Version 0.72 had only minor changes and did not
warrant a mailing list announcement.)

Kill all tasks on abnormal exit:

    When any task exits via a signal, such as SEGV, or exits with a
    non-zero exit code, e.g. using STOP 99 or exit(1), mpiexec will
    kill all other tasks in the parallel process.  It actually waits
    5 seconds before doing the kill to avoid false positives.  If a
    task exits returning 0, mpiexec merrily waits until all the others
    finish too, unless the "-kill" command-line option was used.

Return task 0 exit value:

    The return value of mpiexec to its environment is now the same as
    the return value of task 0 of the parallel process.  This makes it
    possible to track success/failure of the code itself, as opposed to
    mpiexec, in scripts that run parallel jobs.

Fixed hang during initialization (MPICH/GM and MPICH/P4):

    It was possible for process initialization to fail such that mpiexec
    would hang rather than recognize that processes had exited.  One
    example case that triggered this is trying to run a GM code on a
    node without a Myrinet card.  This is now fixed for both MPICH/GM
    and MPICH/P4.

Transform hostnames for message passing library (MPICH/P4):

    The old option "-gige" was a hack used to convince the MPICH/P4
    library to use a different ethernet interface for message passing
    traffic.  This has been generalized to support arbitrary
    transformations that convert from the hostnames that PBS uses
    to the hostnames to be used by the MPICH/P4 library for sending
    application data.

    For example, if your cluster nodes are named amd001, amd002, ...
    and each one also has a gigabit interface which is configured
    with an address that corresponds to names in /etc/hosts or DNS
    of gige001, gige002, ..., you would use this line to start a code
    to use the gigabit ethernet network for hopefully better
    performance:

	mpiexec --transform-hostname=s/amd/gige/ mycode

    At runtime mpiexec calls out to "sed" to perform the actual
    translation.  This external program can be selected at compile time
    to allow support for "perl", for example.  It is expected that users
    will not directly use this option, but that enclosing scripts or
    generalized code launchers may find it handy.  Suggestions for
    improvements are always welcome.

Alternate communication device specifier:

    An environment variable MPIEXEC_COMM can be used to pass the value
    of "-comm" to specify which communication library to use.  This is
    for convenience and to support a batch environment which may want to
    choose the device for the user depending on the configuration of the
    node on which the job was scheduled.

Removed GM-specific options (MPICH/GM):

    The GM-specific option "-no-shmem" is gone, but documentation was
    added to suggest how to use environment variables to control the
    behavior of MPICH/GM and MPICH/P4 instead.  The quick translation
    for that one is:

	GMPI_SHMEM=0 mpiexec mycode

Fixed shmem for cluster nodes (MPICH/SHMEM):

    The SHMEM device supports non-time-shared SMP nodes.  On
    a space shared cluster you could request "-l nodes=1:ppn=2"
    for example, and use an MPI/SHMEM code to run a two-process
    job within the node.

More architecture support:

    Mpiexec compiles on Darwin (Apple/Mac/OSX).  If you are using not
    using the UFS file system be prepared for some fun issues trying to
    compile PBS, though.  :)


Full changelog and downloads at:  http://www.osc.edu/~pw/mpiexec/
Do respond to the list with bug reports, comments, suggestions,
and rants.

		-- Pete



More information about the mpiexec mailing list