mpiexec - Torque issues
Prakash Velayutham
prakash.velayutham at cchmc.org
Tue Nov 29 11:02:11 EST 2005
Pete Wyckoff wrote:
> Prakash.Velayutham at cchmc.org wrote on Thu, 24 Nov 2005 10:47 -0500:
>
>> I installed OpenPBS again and compiled mpiexec 0.8 against that and
>> everything works fine again. Would you have some time to get
>> Torque-1.2.6
>> to work with mpiexec? I could help you in whatever way possible like
>> testing etc. I
>> would really love to get Torque with mpiexec working as newer Torque
>> releases (starting with 2) have multi-server support which would be very
>> useful.
>>
>
> Try enabling debugging in mpich, or look at your code to figure out
> what it is doing that is causing it to exit. Your problem is this
> line:
>
>
>>>> The error returned is:
>>>> p0_7360: p4_error: interrupt SIGx: 15
>>>>
>
> You might run mpiexec with "-v -v", but I don't think it will tell
> you anything. It starts the jobs and they run for almost four
> minutes until one of your MPI tasks dies as above:
>
>
>>>> pbs_mom;Job;51025.ribosome.cchmc.org;start_process: task started, tid
>>>>
>> 2,
>>
>>>> sid 9858, cmd /bin/sh
>>>> 11/13/2005 13:38:43;0001; pbs_mom;Job;TMomFinalizeJob3;job
>>>> 51026.ribosome.cchmc.org started, pid = 9904
>>>> 11/13/2005 13:42:10;0008;
>>>> pbs_mom;Job;51025.ribosome.cchmc.org;kill_task: killing pid 9868 task
>>>>
>> 2
>>
>>>> with sig 9
>>>>
>
> This is quite unlikely a problem with mpiexec as those tend to occur
> during the startup phase, not many minutes later.
>
> -- Pete
Pete,
I will get on this as soon as possible. I just have one question. How is
it that everything works fine with OpenPBS but not with Torque? If it is
a problem with MPI code, it should show the same signs even within
OpenPBS I would guess.
Thanks,
Prakash
More information about the mpiexec
mailing list