mpiexec & PBS Professional 7.1: "PBS reports fewer hosts than TM"

Thomas Zeiser thomas.zeiser at rrze.uni-erlangen.de
Mon Apr 3 06:05:07 EDT 2006


Hello Pete,

On Fri, Mar 31, 2006 at 04:01:46PM -0500, Pete Wyckoff wrote:
> thomas.zeiser at rrze.uni-erlangen.de wrote on Fri, 31 Mar 2006 21:29 +0200:
> > since upgrading from PBS Professional 7.0 to 7.1 we get the
> > following error message when starting jobs with mpiexec
> > 
> >   /opt/mpiexec-0.80/bin/mpiexec -n 2 -comm none hostname
> >   mpiexec: Error: get_hosts: PBS reports fewer hosts 1 than TM 2.
> 
> This is something we were just tracking down for someone else.
> They changed the meaning of the entries in nodelist[] returned
> by tm_nodeinfo().  No longer is it "nodes", but rather "CPUs".
> Rather annoying to switch it on us like this.
> 
> Can you try http://www.osc.edu/~pw/mpiexec/mpiexec-0.81-pre3.tgz ?
> It was tested on a PBSPro cluster environment, but not on an SMP
> like yours.  If it doesn't work, please walk through the while
> loop in get_hosts() (in get_hosts.c) and see if you can spot what
> is going on.  If you say it's fine maybe I'll just spin a release
> soon in case anyone else is testing.

this new version works fine with PBS Pro7.1 on our SMP machine! Thanks.

> There's another new PBSpro-only feature you may want to take a look
> at if you do not have standard IO redirection working, i.e "mpiexec
> --comm=none hostname > /dev/null" should produce no output.  Try to
> ./configure "--enable-pbspro-helper" sometime, but only after you
> get the above problem fixed.

However, "--enable-pbspro-helper" does not work yet:

altix% mpiexec -comm=none hostname
mpiexec-redir-helper: Error: could not resolve "altix:ssinodes=2:mem=7974912kb".
mpiexec-redir-helper: Error: could not resolve "altix:ssinodes=2:mem=7974912kb".
mpiexec-redir-helper: Error: could not resolve "altix:ssinodes=2:mem=7974912kb".
mpiexec-redir-helper: Error: could not resolve "altix:ssinodes=2:mem=7974912kb".
mpiexec: Warning: tasks 0-3 exited with status 1.  

The corresponding output from qstat -f is the followin
Job Id: 74856.altix
    Job_Name = STDIN
    Job_Owner = unrz143 at altix
    resources_used.cpupercent = 0
    resources_used.cput = 00:00:00
    resources_used.mem = 28608kb
    resources_used.ncpus = 4
    resources_used.vmem = 55232kb
    resources_used.walltime = 00:02:48
    job_state = R
    queue = parallel
    server = altix
    Checkpoint = u
    ctime = Mon Apr  3 11:24:12 2006
    Error_Path = /dev/pts/3
    exec_host = altix:ssinodes=2:mem=7974912kb:ncpus=4
    Hold_Types = n
    interactive = True
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Mon Apr  3 11:24:21 2006
    Output_Path = /dev/pts/3
    Priority = 0
    qtime = Mon Apr  3 11:24:12 2006
    Rerunable = True
    Resource_List.ncpus = 4
    Resource_List.nice = 4
    Resource_List.place = pack:group=host
    Resource_List.select = 1:ncpus=4
    Resource_List.walltime = 00:30:00
    stime = 1144056261
    session_id = 12025
    Variable_List = PBS_O_HOME=/home//unrz/unrz143,
        PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=unrz143,
        PBS_O_PATH=/opt/kde3/bin:/opt/gnome/bin:/usr/games:/home/unr
        z/unrz143/bin:/usr/bin/X11:/usr/bin:/bin:/usr/sbin:/sbin:/usr/lib/java/
        jre/bin:/opt/rrze/bin:/usr/pbs/bin,PBS_O_MAIL=/mail//unrz143,
        PBS_O_SHELL=/bin/tcsh,PBS_O_HOST=altix.uni-erlangen.de,
        PBS_O_WORKDIR=/home/cluster64/unrz/unrz143/Install-Cluster64/mpiexec-0
        .81pre3,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=router
    comment = Job run at Mon Apr 03 at 11:24 on altix:ssinodes=2:mem=7974912kb:
        ncpus=4
    alt_id = cpuset=/PBSPro/unrz14374856.altix
    etime = Mon Apr  3 11:24:12 2006
    accounting_id = 0xa8c0bf5000000015

mpiexec obviously gets confused by the node properties added to
exec_host.


An other point regarding mpiexec-redir-helper: especially for
testing new versions of mpiexec it would be very helpful if
mpiexec-redir-helper is not looked for in $PATH but just taken from
the same directory where mpiexec is (or a command line option to
specify the location of mpiexec-redir-helper).

Kind regards,

thomas
-- 
Dipl.-Ing. Thomas ZEISER
Regionales Rechenzentrum Erlangen
Martensstr. 1, 91058 Erlangen, GERMANY


More information about the mpiexec mailing list