mpiexec - Torque issues

Pete Wyckoff pw at osc.edu
Mon Nov 28 10:07:24 EST 2005


Prakash.Velayutham at cchmc.org wrote on Thu, 24 Nov 2005 10:47 -0500:
> I installed OpenPBS again and compiled mpiexec 0.8 against that and 
> everything works fine again. Would you have some time to get
> Torque-1.2.6
> to work with mpiexec? I could help you in whatever way possible like
> testing etc. I 
> would really love to get Torque with mpiexec working as newer Torque 
> releases (starting with 2) have multi-server support which would be very
> useful.

Try enabling debugging in mpich, or look at your code to figure out
what it is doing that is causing it to exit.  Your problem is this
line:

> >> The error returned is:
> >> p0_7360:  p4_error: interrupt SIGx: 15

You might run mpiexec with "-v -v", but I don't think it will tell
you anything.  It starts the jobs and they run for almost four
minutes until one of your MPI tasks dies as above:

> >> pbs_mom;Job;51025.ribosome.cchmc.org;start_process: task started, tid
> 2,
> >> sid 9858, cmd /bin/sh
> >> 11/13/2005 13:38:43;0001;   pbs_mom;Job;TMomFinalizeJob3;job
> >> 51026.ribosome.cchmc.org started, pid = 9904
> >> 11/13/2005 13:42:10;0008;  
> >> pbs_mom;Job;51025.ribosome.cchmc.org;kill_task: killing pid 9868 task
> 2
> >> with sig 9

This is quite unlikely a problem with mpiexec as those tend to occur
during the startup phase, not many minutes later.

		-- Pete


More information about the mpiexec mailing list