mpiexec and tm fundamentals

Joshua Bernstein jbernstein at penguincomputing.com
Thu Oct 4 16:40:44 EDT 2007


  > So, if I use mpiexec a la:
> 
> ---
> #PBS -j oe
> <code to set BEOWULF_JOB_MAP based on PBS_NODEFILE>
> mpiexec -comm none ./mpijob
> ---
> 
> The jobs again, start properly on the nodes (albeit a bit slower), and 
> then when I do a qdel, the processes get properly cleaned off the nodes. 
> The trouble here is that the job still shows up in the TORQUE queue 
> marked as running. The only way to clean up this job is to remove its 
> entries from $PBS_HOME/server_priv/job.

So now these jobs get stuck in the TORQUE queue. pbs_mom reports:

10/04/2007 12:51:31;0001; 
pbs_mom;Job;1138.goldstar.penguincomputing.com;cannot tm_reply to task 1
10/04/2007 12:51:32;0100;   pbs_mom;Req;;Type SignalJob request received 
from PBS_Server at .-1, sock=12
10/04/2007 12:51:32;0001; 
pbs_mom;Job;1138.goldstar.penguincomputing.com;cannot tm_reply to task 1
10/04/2007 12:51:32;0001; 
pbs_mom;Job;1138.goldstar.penguincomputing.com;cannot tm_reply to task 1
10/04/2007 12:52:08;0100;   pbs_mom;Req;;Type StatusJob request received 
from PBS_Server at .-1, sock=10
10/04/2007 12:52:10;0008; 
pbs_mom;Job;1138.goldstar.penguincomputing.com;job was terminated

Notice the "cannot tm_reply to task 1" messages. Perhaps this is why the 
jobs are getting stuck in the TORQUE queue?

What does this message mean?

-Josh


More information about the mpiexec mailing list