ulimit of programs run by mpiexec?

Aquarijen aquarijen at gmail.com
Wed Apr 26 18:35:03 EDT 2006


Hi,
I am the admin of a cluster and we use mpiexec and so far, other than
this one problem, we love it!  I am having a preoblem though...
How do I change the ulimit of the programs I run using mpiexec or how
do I source the environment?  PBS seems to realize the correct ulimit
that I have set in my .bashrc, but the programs launched from mpiexec
within pbs do not. Let me see if I can explain better

I notice that if I add ulimit -s 40000
to my .bashrc file in my shared home directory, it is realized in the
pbs environment of the job.  To test this I changed my .bashrc and
started an interactive job with:
$ qsub -I -l nodes=1,mem=2Gb
and then when the job assigned me a node, I checked with ulimit -s.
For example:

[2vt at b08l02 collin]$ qsub -I -l nodes=1,mem=2Gb
qsub: waiting for job 1244.b08l02.oic.ornl.gov to start
qsub: job 1244.b08l02.oic.ornl.gov ready

Prologue Initiated
Creating temporary directory /scratch/1244.b08l02.oic.ornl.gov on
node(s): b08n061.oic.ornl.gov
Prologue Complete
[2vt at b08n061 ~]$ ulimit -s
40000
[2vt at b08n061 ~]$ exit
qsub: job 1244.b08l02.oic.ornl.gov completed
[2vt at b08l02 collin]$

------------------
When I do the same thing (leave the ulimit setting in my .bashrc) and
submit a batch job, it shows as 40000 from within pbs but not when
being called from within the program being run by mpiexec.

For example, I have a really simple shell script called "ulimit_command":
#!/bin/bash -l
echo `hostname`: `ulimit -s`

and from within my pbs submit script I do a:
pbsdsh /home/2vt/jenstests/collin/ulimit_command

BUT from within a simple hello world mpi program, I do a:
system("ulimit -s");
and compile, the output is correct for ulimit from pbsdsh, but
incorrect from the mpi program.  If I run this same program WITHOUT
using mpiexec, it gives the correct ulimit.

My submit script is:
#!/bin/bash -l

#PBS -S /bin/bash
#PBS -V
#PBS -j eo
#PBS -N OUT-4
#PBS -q workq
#PBS -l walltime=00:00:30,nodes=4:ppn=1

pbsdsh /home/2vt/jenstests/collin/ulimit_command

cd $PBS_O_WORKDIR
echo "Current working directory is `pwd`"
 echo "Node file: $PBS_NODEFILE :"
echo "-------------------"
cat $PBS_NODEFILE
echo "-------------------" NUM_PROCS=`/bin/awk 'END {print NR}' $PBS_NODEFILE`
echo "Running on $NUM_PROCS processors."
echo "Starting run at: `date`"
echo "-------------------"
mpiexec hello_mpi
/home/2vt/jenstests/collin/hello_mpi
echo "-------------------"
echo "Ending run at: `date`"

Then the output of the batch job is:

Prologue Initiated
Creating temporary directory /scratch/1284.b08l02.oic.ornl.gov on
node(s):Warning: Permanently added 'b07n036.oic.ornl.gov,172.16.3.38'
(RSA) to the list of known hosts.
 b07n036.oic.ornl.gov b07n027.oic.ornl.gov
Prologue Complete
b07n036.oic.ornl.gov: 40000
b07n027.oic.ornl.gov: 40000
b07n036.oic.ornl.gov: 40000
b07n027.oic.ornl.gov: 40000
Current working directory is /home/2vt/jenstests/collin
Node file: /var/spool/pbs/aux//1284.b08l02.oic.ornl.gov :
-------------------
b07n036.oic.ornl.gov
b07n036.oic.ornl.gov
b07n027.oic.ornl.gov
b07n027.oic.ornl.gov
------------------- NUM_PROCS=4
Running on  processors.
Starting run at: Wed Apr 26 18:32:21 EDT 2006
-------------------
10240
10240
10240
10240
Hello world from process 3 of 4
Hello world from process 0 of 4
Hello world from process 1 of 4
Hello world from process 2 of 4
40000
Hello world from process 0 of 1
-------------------
Ending run at: Wed Apr 26 18:32:22 EDT 2006
Epilogue Initiated
Removing /scratch/1284.b08l02.oic.ornl.gov on node(s):
b07n036.oic.ornl.gov b07n027.oic.ornl.gov
Floaters flushed on node(s):
Epilogue Complete



Prologue Initiated
Creating temporary directory /scratch/1269.b08l02.oic.ornl.gov on
node(s): b08n061.oic.ornl.gov b07n047.oic.ornl.gov
b07n046.oic.ornl.gov
Prologue Complete
b08n061.oic.ornl.gov: 40000
b07n046.oic.ornl.gov: 40000
b07n047.oic.ornl.gov: 40000
b07n047.oic.ornl.gov: 40000
Current working directory is /home/2vt/jenstests/collin
Node file: /var/spool/pbs/aux//1269.b08l02.oic.ornl.gov :
-------------------
b08n061.oic.ornl.gov
b07n047.oic.ornl.gov
b07n047.oic.ornl.gov
b07n046.oic.ornl.gov
------------------- NUM_PROCS=4
Running on  processors.
Starting run at: Wed Apr 26 17:51:13 EDT 2006
-------------------
10240
10240
10240
10240
Hello world from process 2 of 4
Hello world from process 0 of 4
Hello world from process 1 of 4
Hello world from process 3 of 4
-------------------
Ending run at: Wed Apr 26 17:51:14 EDT 2006
Epilogue Initiated
Removing /scratch/1269.b08l02.oic.ornl.gov on node(s):
b08n061.oic.ornl.gov b07n047.oic.ornl.gov b07n046.oic.ornl.gov
Floaters flushed on node(s):
Epilogue Complete

------------------------------------



Thank you for any help you can give!!!
-Jennifer
Admin, ORNL Institutional Cluster


More information about the mpiexec mailing list