additional command line switches / extensions for mpiexec?
Pete Wyckoff
pw at osc.edu
Thu Nov 23 20:26:12 EST 2006
thomas.zeiser at rrze.uni-erlangen.de wrote on Thu, 23 Nov 2006 10:49 +0100:
> some ideas for additional command line switches / extensions for
> mpiexec came to my mind and I'd like to hear if they would be of
> general interest.
>
> 1) mpiexec already has "-pernode" but thinking of n-way nodes with
> dual-core CPUs, a switch like "-Npernode <n>" might be very useful
> (and probably easy to implement, i.e. in get_hosts.c one probably
> only would have to set nodes[i].availcpu to the correct n)
This sounds like a good suggestion, and pretty easy to implement
in constrain_nodes() along with how -pernode is implemented. I'll
stick it in the tree if you code it up (with manpage entry too).
> 2) mpiexec currently does a block distribution of processes (i.e.
> first fill one node before going to the next one); for some
> applications a round-robin assignment could improve performance.
> This round-robin assignment of course can be done with the help of
> a config file given to mpiexec but if it is of interest for more
> people, a command line switch might be easier to use.
There was one other person who suggested this, off-list. He said
something like:
With small smps and a fat tree there are only two or three task
orderings that are interesting. On big SMPs or more hierarchical
switches task placement is a rich and fruitful undertaking.
Maybe having a means to do this in mpiexec would be of general
use?
with environment varaibles something like
MPIEXEC_TASK_PLACEMENT=PERM|SORT_HOST|SORT_PBS
MPIEXEC_TASK_PERMUTATION={(0,3), (1,5), (2,4)}
The IBM SP includes a fairly complete set of task placement
options and our users make the best of then quite often.
to which I replied (edited a bit):
I agree in principle but the complexity is making my head hurt.
I think supporting a couple basic ones makes sense. If you have
some good ones from SP-land that I'm not familiar with, those
are likely good too. SORT_PBS is obvious: whatever the
scheduler said. [..] You might be better than me at picking good
names for all this; hopefully something standard that everyone
will understand.
There's two spots in the code that want to be changed. One in each
of parse_config() (the -n section, not the fnmatch section) and
argcv_config() where they assign a task to the next free nodes[] entry.
Maybe abstract that out into a new function next_free_node() that would
hand an index into the nodes[] array for one to allocate. You could
have whatever static state variables that would figure out how to do
that allocation. If you stick all this in a new file it would be nice
and isolated and you could have an init function to set things up and
parse the env or command-line vars too.
I'm still somewhat uninterested, but if you can do just round-robin
cleanly, go ahead. Note that it has to be general: if the
allocation does not have the same number of CPUs on each node, it
should get close to round-robin, but not dump core.
And you can do all this today with the --config file mechanism, but
users may find that cumbersome.
> 3) to further improve performance processes, it is sometimes
> necessary to pin processes to certain CPUs (e.g. via
> taskset/PLPA). It would be very helpful if mpiexec could set an
> environment variable which tells the "local rank" of the process on
> the current node, i.e. how many processes have been started via
> pbs_mom before (not including mpiexec shepherd processes)
Garrick has been demanding that mpiexec keep track of CPU numbers
for this reason as well. I'm a bit unclear as to what mechanism
is used to get tasks into tasksets, and what mpiexec should provide,
but want to do implement the most general mechanism again.
Isn't it all PBS's job? I.e., some pbs_mom is going to spawn the
task and will have to put it into a taskset then. How can a process
environment variable be used to control pinning?
Rather than provide a zero-based local rank, I was thinking that
using PBS's idea of a virtual CPU number would be a better idea, in
case there were other processes on the node too. This is the part
after the slash in the string "node01/0+node01/1", e.g., that can be
seen in "qstat -an". Those numbers won't be 0 and 1 if someone else
is already on the node.
-- Pete
More information about the mpiexec
mailing list