MPIEXEC Problems

Pete Wyckoff pw at osc.edu
Tue Jul 8 09:18:08 EDT 2003


James_ODell at Brown.edu said on Mon, 07 Jul 2003 17:25 -0400:
> [..]

Dang, ssh-based mpirun works fine.

> When I try it interactively it works for 1 node and ppn=1
> 
> when I try with ppn=2 I get the following:
> 
> [odell at lou odell]$ qsub -l nodes=1:ppn=2 -I
> qsub: waiting for job 2024.lou.cascv.brown.edu to start
> qsub: job 2024.lou.cascv.brown.edu ready
>                                                                                 
> [odell at compute-0-0 odell]$ mpiexec hello
> mpiexec: Error: wait_one_task_start: tm_poll remote: System error.
> [odell at compute-0-0 odell]$ hello from 0/1 hostname compute-0-0 with 0
> args:
> 
> ==========================
> The funny thing is I have an almost identical cluster where MPIEXEC
> works flawlessly.

Since this does smell like either an mpiexec or pbs problem I recompiled
mpich without shmem to try against a similarly configured mpiexec here,
hoping to find a bug.

I don't see anything wrong, and the configure arguments you used look
fine too.  For completeness, here's what I used for mpich:

    prefix=/usr/local/mpich-1.2.5-1a-p4-no-shmem
    pvfs=/usr/local/pvfs
    CC=pgcc  CFLAGS="-O3"  CXX=pgCC  CXXFLAGS=$CFLAGS \
    FC=pgf77 FFLAGS="-O3"  F90=pgf90 F90FLAGS=$FFLAGS \
    ./configure --prefix=$prefix \
        --with-arch=LINUX --with-device=ch_p4:--socksize=65536 \
        -with-romio="-file_system=pvfs+nfs+ufs -cflags=-I$pvfs/include" \
        -lib="-L$pvfs/lib -lpvfs"

And for mpiexec:

    ./configure --disable-p4-shmem --with-default-comm=p4
    ./mpiexec -np 2 hello

If I forget the "--disable-p4-shmem" flag, all sorts of p4 errors
happen, but no PBS/TM system errors.

I'm fairly convinced there's something unhappy with PBS, but can't
figure out what it is.  You can use the "Big Hammer" of strace, though,
if you want to get a bit messy.  Pick an idle node, login as root,
attach to its pbs_mom:

    strace -vFf -s 400 -o /tmp/strace.out -p $(pgrep pbs_mom)

Then elsewhere, run your "hello" batch job that uses that node:

    qsub -l nodes=compute-0-0:ppn=2 myjob

Finally kill strace (and possibly kill -CONT $(pgrep pbs_mom) on broken
redhat linux 8.0 systems), then examine the output file looking for
anomalies.  Grep for some of the error messages you see in the pbs_mom
log.  And/or mail it to me, along with the relevant snippet from the
pbs_mom log.

		-- Pete



More information about the mpiexec mailing list