ch_p4 / shmem on SMP linux cluster
Pete Wyckoff
pw at osc.edu
Sat Feb 12 12:32:31 EST 2005
eisenmenger at fmp-berlin.de wrote on Fri, 11 Feb 2005 16:11 +0100:
> ./configure --with-comm=ch_p4 --with-device=ch_p4
> -prefix=/usr/local/mpich-32 --without-mpe -opt="-O3" -optf77="-static -tpp7"
[..]
> Question: should I add '-with-device=ch_shmem' here ?
>
> But, I cannot have both, '--with-comm=ch_p4' & '--with-comm=shared' -
> only the one of these options named first in the configure-command would
> be used, I suppose. I'm not very familiar with 'communication' and
> 'devices' and got confused here.
See the section in the mpiexec README around the description for
"--disable-p4-shmem". I'm suggesting you may want to add
"--with-comm=shared" to that mpich configure line above to use the shmem
implementation _inside_ the p4 device implementation. There is another
shmem implementation not related to p4 that one gets with
"--with-device=shmem". Confusing mpich, I admit. Not my fault. :)
Just for clarity, here's a subset of what I used last time to build our
p4 version of mpich. You obviously need other compilers and options,
etc.:
opt=-O3
fsdefs=
prefix=/usr/local/mpich-1.2.6-p4
export CC=icc CFLAGS="$opt $fsdefs"
export CXX=$CC CXXFLAGS="$opt"
export FC=ifort FFLAGS="$opt"
export F90=$FC F90FLAGS=$FFLAGS
export RSHCOMMAND=/usr/bin/rsh # avoid rsh test that fails on nfs4
./configure --prefix=$prefix \
--with-device=ch_p4:--socksize=65536 \
--with-comm=shared \
-enable-sharedlib
> Mpiexec-0.77
> ------------
>
> ./configure --disable-p4-shmem --with-default-comm=mpich-p4
> --with-pbs=/usr/local/encap/torque-1.1.0p0 --with-smp-size=2
>
> NB: seems to work as well without '--with-smp-size=2' !
Indeed! I keep wondering where I have left that in the
documentation---it's an old option that is quietly accepted but ignored.
If you do "--with-comm=shared" for mpich, then you must change to
"--enable-p4-shmem".
[..]
> There's another aspect and an interesting observation, when running the
> 'JAC' benchmark with pmemd (it's source modified according to
> http://structbio.vanderbilt.edu/archives/amber-archive/2004/1080.phtmlon):
[See the PS below.]
> Jobs submitted via: qsub .. -l nodes=<nodes>:ppn=2 <command>
>
> where: <nodes> = No. processors <np> / 2,
> <command> is ..
> .. either:
>
> <path>/mpirun -np <np> -nolocal \
> <my-path>/pmemd <pmemd-input>
>
> .. or:
>
> <path>/mpiexec -kill -nostdin -nostdout \
> <my-path>/pmemd <pmemd-input>
>
> mpirun (as I could sees from a files PI..., created during the job is
> running) does not seem to 'care' about the $PBS_NODEFILE provided by
> PBS: i.e. in our environment, PBS, no matter if '-l nodes=..:ppn=2 or
> ..:ppn=1 (!), "offers" a list of nodes with 2 processors per node, but
> mpirun "chooses", if possible, ONE processor per NODE with the right
> total number of processors,
> e.g. if submitted with '-l nodes=8:ppn=2': uses 16 nodes, with 1
> processor per node.
Yep, that's another good reason not to use mpirun. The machines file
you installed is all it looks at. It does not know about PBS. It walks
through that machines file, putting one process on each host until it
runs out of hosts. Then it goes back to the top of the list and puts
a process on the second processor of each node, conceptually. Again,
ignoring PBS's recommendation entirely.
> mpiexec "follows the suggestion" of PBS (TORQUE) and uses 2 processors
> per node,
> e.g. if submitted with '-l nodes=8:ppn=2': uses 8 nodes with 2 proc. on
> each, strictly according to the contents of $PBS_NODEFILE.
>
> The 'JAC' benchmark simulates 1000 ps of molecular dynamics
> (http://amber.scripps.edu/amber8.bench1.html) for model system.
> The following figure provides sec. of 'clean' computing time (i.e.
> without 'setups time') necessary for this benchmark:
>
> No. processors mpirun mpiexec
>
> 2 282 298
> 4 147 156
> 8 77 82
> 16 41 46
> 32 28 27
>
> From this, one might conclude (?), that, since 2 processors on each
> node "compete" for shared memory, as in the case with mpiexec, comp.
> time's going up.
With your 24 node cluster, just looking at the 8-node process for
example, mpirun will put one process on each of node01..node08. In the
mpiexec case, if you ask PBS for "-l nodes=4:ppn=2", you will get two
processes on each of node01..node04 (or some range of four nodes).
What you are seeing with the longer runtime in the mpiexec case is the
two processes fighting for the shared resources on the single node: the
memory bus is a frequent bottleneck, as might be the single ethernet port.
You can try an mpiexec run with "-l nodes=8:ppn=2" but then say
mpiexec -pernode -np 8 amber
to get only one process on each of 8 nodes.
-- Pete
P.S. Regarding the archived mail in
http://structbio.vanderbilt.edu/archives/amber-archive/2004/1080.phtml
We solve that mpich bug in a different way, attached. Should work for
you with the intel fortran compiler.
-------------- next part --------------
diff -ruN gridftp/src/fortran/src/initf.c pw/src/fortran/src/initf.c
--- gridftp/src/fortran/src/initf.c 2001-12-12 18:36:43.000000000 -0500
+++ pw/src/fortran/src/initf.c 2004-09-03 17:13:23.000000000 -0400
@@ -117,6 +117,40 @@
void mpir_getarg_ ( MPI_Fint *, char *, MPI_Fint );
#endif
+#if defined(__PGI)
+
+FORTRAN_API void FORT_CALL mpi_init_(MPI_Fint *ierr)
+{
+ extern int __argc_save;
+ extern char **__argv_save;
+
+ *ierr = MPI_Init(&__argc_save, &__argv_save);
+}
+
+#elif defined(__ICC) && (__ICC >= 800)
+
+FORTRAN_API void FORT_CALL mpi_init_(MPI_Fint *ierr)
+{
+ extern int for__l_argc;
+ extern char **for__a_argv;
+
+ *ierr = MPI_Init(&for__l_argc, &for__a_argv);
+}
+
+#elif defined(__ICC)
+
+/* old versions of intel compiler */
+FORTRAN_API void FORT_CALL mpi_init_(MPI_Fint *ierr)
+{
+ extern int xargc;
+ extern char **xargv;
+
+ *ierr = MPI_Init(&xargc, &xargv);
+}
+
+#else
+/* unknown compiler, just copy argc/argv */
+
FORTRAN_API void FORT_CALL mpi_init_( MPI_Fint *ierr )
{
int Argc;
@@ -185,4 +219,5 @@
must initialize all languages */
}
+#endif /* unknown compiler, copying argc/argv */
#endif
More information about the mpiexec
mailing list