mpiexec & PBS Professional 7.1: "PBS reports fewer hosts than
TM"
Thomas Zeiser
thomas.zeiser at rrze.uni-erlangen.de
Mon Apr 3 06:05:07 EDT 2006
Hello Pete,
On Fri, Mar 31, 2006 at 04:01:46PM -0500, Pete Wyckoff wrote:
> thomas.zeiser at rrze.uni-erlangen.de wrote on Fri, 31 Mar 2006 21:29 +0200:
> > since upgrading from PBS Professional 7.0 to 7.1 we get the
> > following error message when starting jobs with mpiexec
> >
> > /opt/mpiexec-0.80/bin/mpiexec -n 2 -comm none hostname
> > mpiexec: Error: get_hosts: PBS reports fewer hosts 1 than TM 2.
>
> This is something we were just tracking down for someone else.
> They changed the meaning of the entries in nodelist[] returned
> by tm_nodeinfo(). No longer is it "nodes", but rather "CPUs".
> Rather annoying to switch it on us like this.
>
> Can you try http://www.osc.edu/~pw/mpiexec/mpiexec-0.81-pre3.tgz ?
> It was tested on a PBSPro cluster environment, but not on an SMP
> like yours. If it doesn't work, please walk through the while
> loop in get_hosts() (in get_hosts.c) and see if you can spot what
> is going on. If you say it's fine maybe I'll just spin a release
> soon in case anyone else is testing.
this new version works fine with PBS Pro7.1 on our SMP machine! Thanks.
> There's another new PBSpro-only feature you may want to take a look
> at if you do not have standard IO redirection working, i.e "mpiexec
> --comm=none hostname > /dev/null" should produce no output. Try to
> ./configure "--enable-pbspro-helper" sometime, but only after you
> get the above problem fixed.
However, "--enable-pbspro-helper" does not work yet:
altix% mpiexec -comm=none hostname
mpiexec-redir-helper: Error: could not resolve "altix:ssinodes=2:mem=7974912kb".
mpiexec-redir-helper: Error: could not resolve "altix:ssinodes=2:mem=7974912kb".
mpiexec-redir-helper: Error: could not resolve "altix:ssinodes=2:mem=7974912kb".
mpiexec-redir-helper: Error: could not resolve "altix:ssinodes=2:mem=7974912kb".
mpiexec: Warning: tasks 0-3 exited with status 1.
The corresponding output from qstat -f is the followin
Job Id: 74856.altix
Job_Name = STDIN
Job_Owner = unrz143 at altix
resources_used.cpupercent = 0
resources_used.cput = 00:00:00
resources_used.mem = 28608kb
resources_used.ncpus = 4
resources_used.vmem = 55232kb
resources_used.walltime = 00:02:48
job_state = R
queue = parallel
server = altix
Checkpoint = u
ctime = Mon Apr 3 11:24:12 2006
Error_Path = /dev/pts/3
exec_host = altix:ssinodes=2:mem=7974912kb:ncpus=4
Hold_Types = n
interactive = True
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon Apr 3 11:24:21 2006
Output_Path = /dev/pts/3
Priority = 0
qtime = Mon Apr 3 11:24:12 2006
Rerunable = True
Resource_List.ncpus = 4
Resource_List.nice = 4
Resource_List.place = pack:group=host
Resource_List.select = 1:ncpus=4
Resource_List.walltime = 00:30:00
stime = 1144056261
session_id = 12025
Variable_List = PBS_O_HOME=/home//unrz/unrz143,
PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=unrz143,
PBS_O_PATH=/opt/kde3/bin:/opt/gnome/bin:/usr/games:/home/unr
z/unrz143/bin:/usr/bin/X11:/usr/bin:/bin:/usr/sbin:/sbin:/usr/lib/java/
jre/bin:/opt/rrze/bin:/usr/pbs/bin,PBS_O_MAIL=/mail//unrz143,
PBS_O_SHELL=/bin/tcsh,PBS_O_HOST=altix.uni-erlangen.de,
PBS_O_WORKDIR=/home/cluster64/unrz/unrz143/Install-Cluster64/mpiexec-0
.81pre3,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=router
comment = Job run at Mon Apr 03 at 11:24 on altix:ssinodes=2:mem=7974912kb:
ncpus=4
alt_id = cpuset=/PBSPro/unrz14374856.altix
etime = Mon Apr 3 11:24:12 2006
accounting_id = 0xa8c0bf5000000015
mpiexec obviously gets confused by the node properties added to
exec_host.
An other point regarding mpiexec-redir-helper: especially for
testing new versions of mpiexec it would be very helpful if
mpiexec-redir-helper is not looked for in $PATH but just taken from
the same directory where mpiexec is (or a command line option to
specify the location of mpiexec-redir-helper).
Kind regards,
thomas
--
Dipl.-Ing. Thomas ZEISER
Regionales Rechenzentrum Erlangen
Martensstr. 1, 91058 Erlangen, GERMANY
More information about the mpiexec
mailing list