ch_p4 and enable-p4-shmem
Pete Wyckoff
pw at osc.edu
Wed Feb 22 15:47:13 EST 2006
PBisbal at LexPharma.com wrote on Wed, 22 Feb 2006 14:38 -0500:
> I'm using mpich 1.2.7p1, which I configured thusly:
>
> ./configure --prefix=/usr/local/mpich-1.2.7p1 --enable-sharedlib --with-device=ch_p4 --with-comm=shared
>
> I compiled mpiexec 0.80 with these options:
>
> ./configure --with-prefix=/usr/local/mpiexec-0.80 --with-pbs=/usr/local --with-default-comm=mpich-p4 --enable-p4-shmem
A perfectly compatible configuration. Some flags you don't need,
but no harm.
> When I run runtests.pl with
>
> $available_nodes = 4;
> $smpsize = 1;
>
> The test script doesn't encounter any errors.
If you have dual-processor nodes, and your PBS is configured to know
that, this $smpsize=1 setting means you get "-l nodes=4:ppn=1" jobs.
> When I change $smpsize=2, I get errors like this:
Now you are getting "-l nodes=4:ppn=2" jobs.
> ./runtests.pl
> Testing 4 nodes with SMP size 2.
> 2533 to testqo.5261.01 mpiexec -n 1 hello ...
> 2534 to testqo.5261.02 mpiexec -n 2 hello ..
This test "-n 2" put two tasks on one compute node. It worked!
Meaning your mpich and mpiexec configure settings were just fine.
> 2535 to testqo.5261.03 mpiexec -n 3 hello ...
> 2536 to testqo.5261.04 mpiexec -n 8 hello ...........................
> File testho.5261.04: unexpected line: hello: hw-underdog.lexpharma.com MPI_Init did not finish
> File testho.5261.04: unexpected line: p0_5499: p4_error: interrupt SIGSEGV: 11
> File testho.5261.04: unexpected line: hello: hw-optimus.lexpharma.com MPI_Init did not finish
> File testho.5261.04: unexpected line: rm_29795: p4_error: interrupt SIGSEGV: 11
> File testho.5261.04: unexpected line: hello: hw-appsrv05.lexpharma.com MPI_Init did not finish
> File testho.5261.04: unexpected line: rm_17122: p4_error: interrupt SIGSEGV: 11
> File testho.5261.04: unexpected line: /bin/bash: /scratch.d/hw-underdog/pbisbal/mpiexec-0.80/hello: cannot execute binary file
> File testho.5261.04: unexpected line: /bin/bash: /scratch.d/hw-underdog/pbisbal/mpiexec-0.80/hello: Exec format error
> File testho.5261.04: unexpected line: p0_5499: (24.068400) net_send: could not write to fd=4, errno = 32
> File testho.5261.04: unexpected line: mpiexec: Warning: tasks 0,3 exited with status 1.
> File testho.5261.04: unexpected line: mpiexec: Warning: tasks 1-2 exited with status 139.
Bizarre. Because mpich-p4 handles spawning the second task on each
SMP node, mpiexec only starts 4 tasks, on 4 nodes, to create 8 total
processes in the parallel job. Looks like three of them got bored
waiting for MPI_Init() to finish and died, although some SEGV
happened inside some of the associated mpich-p4 processes when that
happened. One, though, found something wrong with the "hello"
executable itself. That should make you worry.
> Any ideas what is wrong? I've been googling for days, but haven't turned up any meaningful answers. Is this a problem with mpiexec or mpich?
Something is up with your shared file system. Do all compute nodes
mount /scratch.d/hw-underdog/ as NFS identically? Any weird NFS
settings? If you get this again, log onto each of the four nodes
in the job and take a look at the "hello" executable:
ls -la hello
file hello
ldd hello
Don't forget your mpich shared libraries in
/usr/local/mpich-1.2.7p1/lib/shared Those all have to be present
and identical on all systems too.
-- Pete
More information about the mpiexec
mailing list