MPI errors - may be related to mpiexec
Prakash Velayutham
prakash.velayutham at cchmc.org
Thu Nov 17 10:16:14 EST 2005
Hi,
I installed good old OpenPBS and compiled mpiexec 0.8 against that and
everything works fine again. Pete, would you have some time to give
suggestions so I can get Torque to work with mpiexec? I would help you
in whatever way possible. I would really love to get Torque with mpiexec
working as newer Torque releases (starting 2) have multi-server support
which would be very useful.
Thanks,
Prakash
Prakash Velayutham wrote:
> Another experiment shows that the SIGx error I am getting is only
> occuring when I use mpiexec and not mpirun. could someone please
> confirm this or otherwise? But of course we don't want to use mpirun.
>
> Prakash
>
> Prakash Velayutham wrote:
>
>> Hi All,
>>
>> I am reposting this as I am not able to join the previous thread.
>> My setup is as follows:
>>
>> SuSE Pro 9.3 with mpich-1.2.7p1, mpiexec-0.80 (from OSC),
>> torque-1.2.0p5.
>> Head node is a Intel Xeon (not EM64T) where these software are compiled.
>>
>> The execution host is a Intel P3 system. Would that change anything?
>>
>> The error returned is:
>> p0_7360: p4_error: interrupt SIGx: 15
>>
>> I compiled mpich with
>> /configure --with-comm=shared --with-device=ch_p4 --without-romio
>> -prefix=/usr/local/mpich-1.2.7p1 -rsh=/usr/bin/ssh --enable-sharedlib
>>
>> and mpiexec with
>> /configure --prefix=/usr/local/mpiexec-0.80
>> --with-pbs=/usr/local/torque-1.2.0p5 --with-default-comm=mpich-p4
>> --with-mpicc=/usr/local/mpich-1.2.7p1/bin/mpicc
>> --with-mpif77=/usr/local/mpich-1.2.7p1/bin/mpif77
>>
>> Here is the output of tracejob:
>> 11/13/2005 13:39:31 S enqueuing into users, state 1 hop 1
>> 11/13/2005 13:39:31 S Job Queued at request of
>> rjain at ribosome.cchmc.org, owner = rjain at ribosome.cchmc.org, job name =
>> test3, queue = users
>> 11/13/2005 13:39:31 S Job Modified at request of
>> Scheduler at ribosome.cchmc.org
>> 11/13/2005 13:39:31 S Job Run at request of
>> Scheduler at ribosome.cchmc.org
>> 11/13/2005 13:39:31 A queue=users
>> 11/13/2005 13:39:32 L Job Run
>> 11/13/2005 13:39:32 A user=rjain group=users jobname=test3
>> queue=users ctime=1131907170 qtime=1131907171 etime=1131907171
>> start=1131907172 exec_host=serine/0
>> Resource_List.neednodes=1:ppn=1:serine
>> Resource_List.nodect=1
>> Resource_List.nodes=1:ppn=1:serine 11/13/2005 13:43:11 S
>> Exit_status=0 resources_used.cput=00:00:00
>> resources_used.mem=31760kb resources_used.vmem=134964kb
>> resources_used.walltime=00:03:39
>> 11/13/2005 13:43:11 S dequeuing from users, state EXITING
>> 11/13/2005 13:43:11 A user=rjain group=users jobname=test3
>> queue=users ctime=1131907170 qtime=1131907171 etime=1131907171
>> start=1131907172 exec_host=serine/0
>> Resource_List.neednodes=1:ppn=1:serine
>> Resource_List.nodect=1
>> Resource_List.nodes=1:ppn=1:serine session=9800 end=1131907391
>> Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=31760kb
>> resources_used.vmem=134964kb
>> resources_used.walltime=00:03:39
>>
>> And the mom log is:
>>
>> 11/13/2005 13:38:39;0001; pbs_mom;Job;TMomFinalizeJob3;job
>> 51025.ribosome.cchmc.org started, pid = 9800
>> 11/13/2005 13:38:39;0008;
>> pbs_mom;Job;51025.ribosome.cchmc.org;start_process: task started, tid 2,
>> sid 9858, cmd /bin/sh
>> 11/13/2005 13:38:43;0001; pbs_mom;Job;TMomFinalizeJob3;job
>> 51026.ribosome.cchmc.org started, pid = 9904
>> 11/13/2005 13:42:10;0008;
>> pbs_mom;Job;51025.ribosome.cchmc.org;kill_task: killing pid 9868 task 2
>> with sig 9
>> 11/13/2005 13:42:15;0008;
>> pbs_mom;Job;51025.ribosome.cchmc.org;kill_task: killing pid 9869 task 2
>> with sig 9
>> 11/13/2005 13:42:18;0008;
>> pbs_mom;Job;51025.ribosome.cchmc.org;kill_task: killing pid 9895 task 2
>> with sig 9
>> 11/13/2005 13:42:18;0080;
>> pbs_mom;Job;51025.ribosome.cchmc.org;scan_for_terminated: job
>> 51025.ribosome.cchmc.org task 2 terminated, sid 9858
>> 11/13/2005 13:42:18;0080;
>> pbs_mom;Job;51025.ribosome.cchmc.org;scan_for_terminated: job
>> 51025.ribosome.cchmc.org task 1 terminated, sid 9800
>> 11/13/2005 13:42:18;0008;
>> pbs_mom;Job;51025.ribosome.cchmc.org;Terminated
>>
>> Any suggestions please?
>>
>> Prakash
>
More information about the mpiexec
mailing list