mpich-shmem problems
Pete Wyckoff
pw at osc.edu
Wed Aug 20 11:55:40 EDT 2003
ruda at ics.muni.cz said on Wed, 20 Aug 2003 17:02 +0200:
> I have problems with mpiexec with mpich-shmem. I'm using mpiexec 0.74,
> mpich-1.2.5 and PBSPro 5.2.2 on dual processor linux cluster. When using
> comm=mpich-p4 and mpich-gm, MPI jobs are started as expected. However, with
> comm-shmem (and job compiled with mpich configured to use only shmem),
> job is started twice (when job is submited to PBS using -l nodes=1:ppn=2)
> - mpiexec spawns two tasks while mpirun from mpich starts just one process
> (which forks second process later itself).
>
> As a temporary fix, I have modified mpiexec to start only one process when
> comm=shmem is used.
>
> diff mpiexec.c.orig mpiexec.c
> 420a421,422
> > /* hack, start only one proces when SHMEM is used */
> > if (cl_args->comm == COMM_SHMEM) cl_args->pernode = 1;
That certainly fixes it, but the code that is supposed to do this lives
in a big section in get_hosts.c surrounded by "if (cl_args->comm ==
COMM_SHMEM)". It tries to handle two cases:
1. Time-shared hosts seem to use "ncpus". A snippet of the server_priv
nodes file is:
coe3:ts np=24
coe4:ts np=24
Running a job:
coe3$ qsub -I -l ncpus=2 -l walltime=2:00:00
qsub: waiting for job 4122.coe3 to start
qsub: job 4122.coe3 ready
coe4$ qstat -f $PBS_JOBID | fgrep Resource_List
Resource_List.cput = 01:00:00
Resource_List.ncpus = 2
Resource_List.vmem = 1gb
Resource_List.walltime = 02:00:00
2. Non-time-shared hosts seem to use "nodect". Nodes file might
look like:
mck026 np=2
mck027 np=2
Running a job:
mck-login1$ qsub -I -l nodes=1:ppn=2
qsub: waiting for job 31963.nfs1.osc.edu to start
qsub: job 31963.nfs1.osc.edu ready
mck027$ qstat -f $PBS_JOBID | fgrep Resource_List
Resource_List.neednodes = mck027:ppn=2
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=2
Resource_List.walltime = 01:00:00
You seem to fall in case (2), as do most cluster-type installations.
Does your nodes file look the same, and do you show similar output for
Resource_List variables? There could be some PBSPro differences that
I do not know about.
Also can you add "-v" to mpiexec? Do you get similar output?:
mck027$ mpiexec -v --comm=shmem hello
resolve_exe: prefixing dot to executable: "./hello"
node 0: name = mck027, mpname = mck027, cpu = 1
wait_one_task_start: evt = 2, task 0 host mck027
All 1 task started.
hello from 0/2 hostname mck027 pid 30794 with 0 args:
hello from 1/2 hostname mck027 pid 30795 with 0 args:
wait_tasks: numspawned = 1, got evt 3 for tid 11 host mck027 status 0
I'm confused why it tries to start two tasks in your case.
-- Pete
More information about the mpiexec
mailing list