MPI errors - may be related to mpiexec

Prakash Velayutham Prakash.Velayutham at cchmc.org
Sun Nov 13 15:14:23 EST 2005


Hi All,

I am reposting this as I am not able to join the previous thread. 

My setup is as follows:

SuSE Pro 9.3 with mpich-1.2.7p1, mpiexec-0.80 (from OSC),
torque-1.2.0p5.
Head node is a Intel Xeon (not EM64T) where these software are compiled.

The execution host is a Intel P3 system. Would that change anything?

The error returned is:
p0_7360:  p4_error: interrupt SIGx: 15

I compiled mpich with
/configure --with-comm=shared --with-device=ch_p4 --without-romio 
-prefix=/usr/local/mpich-1.2.7p1 -rsh=/usr/bin/ssh --enable-sharedlib

and mpiexec with
/configure --prefix=/usr/local/mpiexec-0.80 
--with-pbs=/usr/local/torque-1.2.0p5 --with-default-comm=mpich-p4 
--with-mpicc=/usr/local/mpich-1.2.7p1/bin/mpicc 
--with-mpif77=/usr/local/mpich-1.2.7p1/bin/mpif77

Here is the output of tracejob:
11/13/2005 13:39:31  S    enqueuing into users, state 1 hop 1
11/13/2005 13:39:31  S    Job Queued at request of
rjain at ribosome.cchmc.org, owner = rjain at ribosome.cchmc.org, job name =
test3, queue = users
11/13/2005 13:39:31  S    Job Modified at request of
Scheduler at ribosome.cchmc.org
11/13/2005 13:39:31  S    Job Run at request of
Scheduler at ribosome.cchmc.org
11/13/2005 13:39:31  A    queue=users
11/13/2005 13:39:32  L    Job Run
11/13/2005 13:39:32  A    user=rjain group=users jobname=test3
queue=users ctime=1131907170 qtime=1131907171 etime=1131907171
start=1131907172 exec_host=serine/0
Resource_List.neednodes=1:ppn=1:serine
                          Resource_List.nodect=1
Resource_List.nodes=1:ppn=1:serine 
11/13/2005 13:43:11  S    Exit_status=0 resources_used.cput=00:00:00
resources_used.mem=31760kb resources_used.vmem=134964kb
resources_used.walltime=00:03:39
11/13/2005 13:43:11  S    dequeuing from users, state EXITING
11/13/2005 13:43:11  A    user=rjain group=users jobname=test3
queue=users ctime=1131907170 qtime=1131907171 etime=1131907171
start=1131907172 exec_host=serine/0
Resource_List.neednodes=1:ppn=1:serine
                          Resource_List.nodect=1
Resource_List.nodes=1:ppn=1:serine session=9800 end=1131907391
Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=31760kb
                          resources_used.vmem=134964kb
resources_used.walltime=00:03:39

And the mom log is:

11/13/2005 13:38:39;0001;   pbs_mom;Job;TMomFinalizeJob3;job
51025.ribosome.cchmc.org started, pid = 9800
11/13/2005 13:38:39;0008;  
pbs_mom;Job;51025.ribosome.cchmc.org;start_process: task started, tid 2,
sid 9858, cmd /bin/sh
11/13/2005 13:38:43;0001;   pbs_mom;Job;TMomFinalizeJob3;job
51026.ribosome.cchmc.org started, pid = 9904
11/13/2005 13:42:10;0008;  
pbs_mom;Job;51025.ribosome.cchmc.org;kill_task: killing pid 9868 task 2
with sig 9
11/13/2005 13:42:15;0008;  
pbs_mom;Job;51025.ribosome.cchmc.org;kill_task: killing pid 9869 task 2
with sig 9
11/13/2005 13:42:18;0008;  
pbs_mom;Job;51025.ribosome.cchmc.org;kill_task: killing pid 9895 task 2
with sig 9
11/13/2005 13:42:18;0080;  
pbs_mom;Job;51025.ribosome.cchmc.org;scan_for_terminated: job
51025.ribosome.cchmc.org task 2 terminated, sid 9858
11/13/2005 13:42:18;0080;  
pbs_mom;Job;51025.ribosome.cchmc.org;scan_for_terminated: job
51025.ribosome.cchmc.org task 1 terminated, sid 9800
11/13/2005 13:42:18;0008;  
pbs_mom;Job;51025.ribosome.cchmc.org;Terminated

Any suggestions please?

Prakash


More information about the mpiexec mailing list