ch_p4 and enable-p4-shmem

Bisbal, Prentice PBisbal at LexPharma.com
Wed Feb 22 16:55:00 EST 2006


Pete, 

Thanks for the helpful information. It was very reassuring to know that
mpiexec and mpich were working correctly. Eliminating them from the list
of possible sources made it easy for me to find the source of the
problem: me! 

I forgot that mpiexec uses pbs to determine which hosts to run on. For
some reason, I was still thinking it was using
/usr/local/mpich/share/machines.LINUX. I have several SGI systems
running IRIX, and the they obviously can't run binaries for Linux on
x86. Once I removed my SGI hosts from the pbs config, all tests ran
fine. 

Thanks again for the help.

Prentice Bisbal
Unix Administrator
Lexicon Pharmaceuticals
350 Carter Rd
Princeton, NJ 08540
609-466-5578
pbisbal at lexpharma.com

-----Original Message-----
From: Pete Wyckoff [mailto:pw at osc.edu] 
Sent: Wednesday, February 22, 2006 3:47 PM
To: Bisbal, Prentice
Cc: mpiexec at osc.edu
Subject: Re: ch_p4 and enable-p4-shmem

PBisbal at LexPharma.com wrote on Wed, 22 Feb 2006 14:38 -0500:
> I'm using mpich 1.2.7p1, which I configured thusly:
> 
> ./configure --prefix=/usr/local/mpich-1.2.7p1 --enable-sharedlib 
> --with-device=ch_p4 --with-comm=shared
> 
> I compiled mpiexec 0.80 with these options:
> 
> ./configure --with-prefix=/usr/local/mpiexec-0.80 
> --with-pbs=/usr/local --with-default-comm=mpich-p4 --enable-p4-shmem

A perfectly compatible configuration.  Some flags you don't need, but no
harm.

> When I run runtests.pl with
> 
> $available_nodes = 4;
> $smpsize = 1;
> 
> The test script doesn't encounter any errors. 

If you have dual-processor nodes, and your PBS is configured to know
that, this $smpsize=1 setting means you get "-l nodes=4:ppn=1" jobs.

> When I change $smpsize=2, I get errors like this:

Now you are getting "-l nodes=4:ppn=2" jobs.

>  ./runtests.pl
> Testing 4 nodes with SMP size 2.
> 2533 to testqo.5261.01 mpiexec -n 1 hello ...
> 2534 to testqo.5261.02 mpiexec -n 2 hello ..

This test "-n 2" put two tasks on one compute node.  It worked!
Meaning your mpich and mpiexec configure settings were just fine.

> 2535 to testqo.5261.03 mpiexec -n 3 hello ...
> 2536 to testqo.5261.04 mpiexec -n 8 hello ...........................
> File testho.5261.04: unexpected line: hello: hw-underdog.lexpharma.com

> MPI_Init did not finish File testho.5261.04: unexpected line: p0_5499:

> p4_error: interrupt SIGSEGV: 11 File testho.5261.04: unexpected line: 
> hello: hw-optimus.lexpharma.com  MPI_Init did not finish File 
> testho.5261.04: unexpected line: rm_29795:  p4_error: interrupt 
> SIGSEGV: 11 File testho.5261.04: unexpected line: hello: 
> hw-appsrv05.lexpharma.com  MPI_Init did not finish File 
> testho.5261.04: unexpected line: rm_17122:  p4_error: interrupt 
> SIGSEGV: 11 File testho.5261.04: unexpected line: /bin/bash: 
> /scratch.d/hw-underdog/pbisbal/mpiexec-0.80/hello: cannot execute
binary file File testho.5261.04: unexpected line: /bin/bash:
/scratch.d/hw-underdog/pbisbal/mpiexec-0.80/hello: Exec format error
File testho.5261.04: unexpected line: p0_5499: (24.068400) net_send:
could not write to fd=4, errno = 32 File testho.5261.04: unexpected
line: mpiexec: Warning: tasks 0,3 exited with status 1.
> File testho.5261.04: unexpected line: mpiexec: Warning: tasks 1-2
exited with status 139.

Bizarre.  Because mpich-p4 handles spawning the second task on each SMP
node, mpiexec only starts 4 tasks, on 4 nodes, to create 8 total
processes in the parallel job.  Looks like three of them got bored
waiting for MPI_Init() to finish and died, although some SEGV happened
inside some of the associated mpich-p4 processes when that happened.
One, though, found something wrong with the "hello"
executable itself.  That should make you worry.

> Any ideas what is wrong? I've been googling for days, but haven't
turned up any meaningful answers. Is this a problem with mpiexec or
mpich? 

Something is up with your shared file system.  Do all compute nodes
mount /scratch.d/hw-underdog/ as NFS identically?  Any weird NFS
settings?  If you get this again, log onto each of the four nodes in the
job and take a look at the "hello" executable:

    ls -la hello
    file hello
    ldd hello

Don't forget your mpich shared libraries in
/usr/local/mpich-1.2.7p1/lib/shared  Those all have to be present and
identical on all systems too.

		-- Pete



More information about the mpiexec mailing list