ch_p4 and enable-p4-shmem

Pete Wyckoff pw at osc.edu
Wed Feb 22 15:47:13 EST 2006


PBisbal at LexPharma.com wrote on Wed, 22 Feb 2006 14:38 -0500:
> I'm using mpich 1.2.7p1, which I configured thusly:
> 
> ./configure --prefix=/usr/local/mpich-1.2.7p1 --enable-sharedlib --with-device=ch_p4 --with-comm=shared
> 
> I compiled mpiexec 0.80 with these options:
> 
> ./configure --with-prefix=/usr/local/mpiexec-0.80 --with-pbs=/usr/local --with-default-comm=mpich-p4 --enable-p4-shmem 

A perfectly compatible configuration.  Some flags you don't need,
but no harm.

> When I run runtests.pl with 
> 
> $available_nodes = 4;
> $smpsize = 1;
> 
> The test script doesn't encounter any errors. 

If you have dual-processor nodes, and your PBS is configured to know
that, this $smpsize=1 setting means you get "-l nodes=4:ppn=1" jobs.

> When I change $smpsize=2, I get errors like this:

Now you are getting "-l nodes=4:ppn=2" jobs.

>  ./runtests.pl 
> Testing 4 nodes with SMP size 2.
> 2533 to testqo.5261.01 mpiexec -n 1 hello ...
> 2534 to testqo.5261.02 mpiexec -n 2 hello ..

This test "-n 2" put two tasks on one compute node.  It worked!
Meaning your mpich and mpiexec configure settings were just fine.

> 2535 to testqo.5261.03 mpiexec -n 3 hello ...
> 2536 to testqo.5261.04 mpiexec -n 8 hello ...........................
> File testho.5261.04: unexpected line: hello: hw-underdog.lexpharma.com  MPI_Init did not finish
> File testho.5261.04: unexpected line: p0_5499:  p4_error: interrupt SIGSEGV: 11
> File testho.5261.04: unexpected line: hello: hw-optimus.lexpharma.com  MPI_Init did not finish
> File testho.5261.04: unexpected line: rm_29795:  p4_error: interrupt SIGSEGV: 11
> File testho.5261.04: unexpected line: hello: hw-appsrv05.lexpharma.com  MPI_Init did not finish
> File testho.5261.04: unexpected line: rm_17122:  p4_error: interrupt SIGSEGV: 11
> File testho.5261.04: unexpected line: /bin/bash: /scratch.d/hw-underdog/pbisbal/mpiexec-0.80/hello: cannot execute binary file
> File testho.5261.04: unexpected line: /bin/bash: /scratch.d/hw-underdog/pbisbal/mpiexec-0.80/hello: Exec format error
> File testho.5261.04: unexpected line: p0_5499: (24.068400) net_send: could not write to fd=4, errno = 32
> File testho.5261.04: unexpected line: mpiexec: Warning: tasks 0,3 exited with status 1.
> File testho.5261.04: unexpected line: mpiexec: Warning: tasks 1-2 exited with status 139.

Bizarre.  Because mpich-p4 handles spawning the second task on each
SMP node, mpiexec only starts 4 tasks, on 4 nodes, to create 8 total
processes in the parallel job.  Looks like three of them got bored
waiting for MPI_Init() to finish and died, although some SEGV
happened inside some of the associated mpich-p4 processes when that
happened.  One, though, found something wrong with the "hello"
executable itself.  That should make you worry.

> Any ideas what is wrong? I've been googling for days, but haven't turned up any meaningful answers. Is this a problem with mpiexec or mpich? 

Something is up with your shared file system.  Do all compute nodes
mount /scratch.d/hw-underdog/ as NFS identically?  Any weird NFS
settings?  If you get this again, log onto each of the four nodes
in the job and take a look at the "hello" executable:

    ls -la hello
    file hello
    ldd hello

Don't forget your mpich shared libraries in
/usr/local/mpich-1.2.7p1/lib/shared  Those all have to be present
and identical on all systems too.

		-- Pete


More information about the mpiexec mailing list