ch_p4 / shmem on SMP linux cluster

Frank Eisenmenger eisenmenger at fmp-berlin.de
Fri Feb 11 10:11:24 EST 2005


Hi Pete,

thanks for your reply.

> The question to ponder is, should you enable the p4 shared memory
> implementation to try to speed up communications inside a single
> multiprocessor node?  We use it here; I haven't discovered a consensus
> on the matter.

In order to understand, how to 'enable the p4 shared memory' I'll better 
describe in detail, what I've done.

I'd like mpich & my application (pmemd from Amber8 package) to be clean 
32-bit (just like AMD's Opterons, our Intel Noconas allow for 64- & 
32-bit-applications, both) and use Intel's 32-bit Fortran compiler as 
backend for 'mpif77' & 'mpif90' (this is best for compiling 'pmemd'):

Mpich-1.26
-----------

export CC="gcc -m32"
export CFLAGS="-m32"
export CCFLAGS="-m32"
export CLINKER="gcc -m32"
export CCLINKER="g++ -m32"
export FC=ifort
export F90=ifort

source /usr/local/intel/bin/ifortvars.sh

./configure --with-comm=ch_p4 --with-device=ch_p4 
-prefix=/usr/local/mpich-32 --without-mpe -opt="-O3" -optf77="-static -tpp7"

(excluded 'mpe', since it would automatically include 64-bit X11-libs,
'-tpp7' is to optimize for Xeons)


Question: should I add '-with-device=ch_shmem' here ?

But, I cannot have both, '--with-comm=ch_p4' & '--with-comm=shared' - 
only the one of these options named first in the configure-command would 
be used, I suppose. I'm not very familiar with 'communication' and 
'devices' and got confused here.


make & make install (as 'su')

+ put new 'machines.LINUX' to /usr/local/mpich-32/share/ with all nodes 
(important for 'mpirun' only):

node1:2
..
node24:2


Mpiexec-0.77
------------

./configure --disable-p4-shmem --with-default-comm=mpich-p4 
--with-pbs=/usr/local/encap/torque-1.1.0p0 --with-smp-size=2

NB: seems to work as well without '--with-smp-size=2' !

make

tests:

/usr/local/mpich-32/bin/mpicc -m32 -o hello hello.c
./runtests.pl

--> works out alright


source /usr/local/intel/bin/ifortvars.sh

/usr/local/mpich-32/bin/mpif77 -static-libcxa -o hello hellof.f

(NB: '-static-libcxa' helps, since 'our' nodes do not know the location 
of some of Intel's shared libs)

-> errors because of too many command line parameters


make install (as 'su')

With respect to the above, how should I 'enable the p4 shared memory' ?


 > Speaking from a user support angle, I would say no.  There is enough
 > debate on whether the mpich1/p4/shmem implementation is any faster than
 > just using mpich/p4/TCP on a node.

There's another aspect and an interesting observation, when running the 
  'JAC' benchmark with pmemd (it's source modified according to 
http://structbio.vanderbilt.edu/archives/amber-archive/2004/1080.phtmlon):

Jobs submitted via:  qsub .. -l nodes=<nodes>:ppn=2  <command>

where: <nodes> = No. processors <np> / 2,
        <command> is ..
.. either:

<path>/mpirun -np <np> -nolocal \
<my-path>/pmemd <pmemd-input>

.. or:

<path>/mpiexec -kill -nostdin -nostdout \
<my-path>/pmemd <pmemd-input>

mpirun (as I could sees from a files PI..., created during the job is 
running) does not seem to 'care' about the $PBS_NODEFILE provided by 
PBS: i.e. in our environment, PBS, no matter if '-l nodes=..:ppn=2 or 
..:ppn=1 (!), "offers" a list of nodes with 2 processors per node, but 
mpirun "chooses", if possible, ONE processor per NODE with the right 
total number of processors,
e.g. if  submitted with '-l nodes=8:ppn=2': uses 16 nodes, with 1 
processor per node.

mpiexec "follows the suggestion" of PBS (TORQUE) and uses 2 processors 
per node,
e.g. if submitted with '-l nodes=8:ppn=2': uses 8 nodes with 2 proc. on 
each, strictly according to the contents of $PBS_NODEFILE.

The 'JAC' benchmark simulates 1000 ps of molecular dynamics 
(http://amber.scripps.edu/amber8.bench1.html) for model system.
The following figure provides sec. of 'clean' computing time (i.e. 
without 'setups time') necessary for this benchmark:

No. processors		mpirun		mpiexec

  2			282		298
  4			147		156
  8			 77		 82
16			 41		 46
32			 28		 27

 From this, one might conclude (?), that, since 2 processors on each 
node "compete" for shared memory, as in the case with mpiexec, comp. 
time's going up.

I am not very familiar with how PBS/Torque was set up by the vendor of 
our cluster and how that all works together with mpich/mpiexec. Any 
suggestion would be appreciated !


Frank Eisenmenger.

-- 
Dr. Frank Eisenmenger
Forschungsinstitut für Molekulare Pharmakologie
Abt. NMR-unterstützte Strukturforschung
Tel.    +49/0-30-94793-278
FAX     +49/0-30-94793-169
Web     www.fmp-berlin.de/NMR
E-Mail  eisenmenger at fmp-berlin.de




More information about the mpiexec mailing list