Bug or strange results in runtests.pl

Anna Jonna Armannsdottir annaj at hi.is
Mon Nov 13 11:30:44 EST 2006


On mán, 2006-11-13 at 07:54 -0800, Garrick Staples wrote:
> On Mon, Nov 13, 2006 at 03:41:57PM +0000, Anna Jonna Armannsdottir
> alleged:
> > =>> PBS: job killed: walltime 308 exceeded limit 300
> 
> Did I miss this line in the torqueusers thread?
Hi Garrick and many thanks for your fast response. 
I did not post it there. 

> This job ran out of walltime.  The job needs to request more?
The job had ample walltime and should never have exceeded 5 minutes. 

Seemingly the job is put to sleep and then gets an abort signal on 
Node 0. 
It seems that the job sleeps on until it is thrown out of the batch 
queue. 
It should have woken up by this signal and woken all processes in the
job to react to the signal. Awake, the proces can react to the signal
(abort in this case) and in the genral case, resume to its previous
state after responding to the signal. 

But I don't know, just a reasonable guess. :) 

Have a look at the qsub script 
(put in the backslash for readability):
#!/bin/sh
#PBS -l nodes=4:ppn=1
#PBS -l walltime=5:00
#PBS -l cput=20:00
#PBS -j oe
#PBS -o testqo.8461.34
cd /home/user/mpiexec-0.81
echo '[1+pd5>x]sx0lxxq' | dc | ./mpiexec --comm=pmi  \
hello -sleep -abort 0 > testho.8461.34 2>&1

-- 
Kindest Regards, Anna Jonna Ármannsdóttir,
Unix System Aministration, Computing Services, 
University of Iceland.



More information about the mpiexec mailing list