MPI errors - may be related to mpiexec

Prakash Velayutham prakash.velayutham at cchmc.org
Tue Nov 15 11:31:28 EST 2005


Another experiment shows that the SIGx error I am getting is only 
occuring when I use mpiexec and not mpirun. could someone please confirm 
this or otherwise? But of course we don't want to use mpirun.

Prakash

Prakash Velayutham wrote:

>Hi All,
>
>I am reposting this as I am not able to join the previous thread. 
>
>My setup is as follows:
>
>SuSE Pro 9.3 with mpich-1.2.7p1, mpiexec-0.80 (from OSC),
>torque-1.2.0p5.
>Head node is a Intel Xeon (not EM64T) where these software are compiled.
>
>The execution host is a Intel P3 system. Would that change anything?
>
>The error returned is:
>p0_7360:  p4_error: interrupt SIGx: 15
>
>I compiled mpich with
>/configure --with-comm=shared --with-device=ch_p4 --without-romio 
>-prefix=/usr/local/mpich-1.2.7p1 -rsh=/usr/bin/ssh --enable-sharedlib
>
>and mpiexec with
>/configure --prefix=/usr/local/mpiexec-0.80 
>--with-pbs=/usr/local/torque-1.2.0p5 --with-default-comm=mpich-p4 
>--with-mpicc=/usr/local/mpich-1.2.7p1/bin/mpicc 
>--with-mpif77=/usr/local/mpich-1.2.7p1/bin/mpif77
>
>Here is the output of tracejob:
>11/13/2005 13:39:31  S    enqueuing into users, state 1 hop 1
>11/13/2005 13:39:31  S    Job Queued at request of
>rjain at ribosome.cchmc.org, owner = rjain at ribosome.cchmc.org, job name =
>test3, queue = users
>11/13/2005 13:39:31  S    Job Modified at request of
>Scheduler at ribosome.cchmc.org
>11/13/2005 13:39:31  S    Job Run at request of
>Scheduler at ribosome.cchmc.org
>11/13/2005 13:39:31  A    queue=users
>11/13/2005 13:39:32  L    Job Run
>11/13/2005 13:39:32  A    user=rjain group=users jobname=test3
>queue=users ctime=1131907170 qtime=1131907171 etime=1131907171
>start=1131907172 exec_host=serine/0
>Resource_List.neednodes=1:ppn=1:serine
>                          Resource_List.nodect=1
>Resource_List.nodes=1:ppn=1:serine 
>11/13/2005 13:43:11  S    Exit_status=0 resources_used.cput=00:00:00
>resources_used.mem=31760kb resources_used.vmem=134964kb
>resources_used.walltime=00:03:39
>11/13/2005 13:43:11  S    dequeuing from users, state EXITING
>11/13/2005 13:43:11  A    user=rjain group=users jobname=test3
>queue=users ctime=1131907170 qtime=1131907171 etime=1131907171
>start=1131907172 exec_host=serine/0
>Resource_List.neednodes=1:ppn=1:serine
>                          Resource_List.nodect=1
>Resource_List.nodes=1:ppn=1:serine session=9800 end=1131907391
>Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=31760kb
>                          resources_used.vmem=134964kb
>resources_used.walltime=00:03:39
>
>And the mom log is:
>
>11/13/2005 13:38:39;0001;   pbs_mom;Job;TMomFinalizeJob3;job
>51025.ribosome.cchmc.org started, pid = 9800
>11/13/2005 13:38:39;0008;  
>pbs_mom;Job;51025.ribosome.cchmc.org;start_process: task started, tid 2,
>sid 9858, cmd /bin/sh
>11/13/2005 13:38:43;0001;   pbs_mom;Job;TMomFinalizeJob3;job
>51026.ribosome.cchmc.org started, pid = 9904
>11/13/2005 13:42:10;0008;  
>pbs_mom;Job;51025.ribosome.cchmc.org;kill_task: killing pid 9868 task 2
>with sig 9
>11/13/2005 13:42:15;0008;  
>pbs_mom;Job;51025.ribosome.cchmc.org;kill_task: killing pid 9869 task 2
>with sig 9
>11/13/2005 13:42:18;0008;  
>pbs_mom;Job;51025.ribosome.cchmc.org;kill_task: killing pid 9895 task 2
>with sig 9
>11/13/2005 13:42:18;0080;  
>pbs_mom;Job;51025.ribosome.cchmc.org;scan_for_terminated: job
>51025.ribosome.cchmc.org task 2 terminated, sid 9858
>11/13/2005 13:42:18;0080;  
>pbs_mom;Job;51025.ribosome.cchmc.org;scan_for_terminated: job
>51025.ribosome.cchmc.org task 1 terminated, sid 9800
>11/13/2005 13:42:18;0008;  
>pbs_mom;Job;51025.ribosome.cchmc.org;Terminated
>
>Any suggestions please?
>
>Prakash
>


More information about the mpiexec mailing list