mpiexec: Warning: tasks 0-168,
172-175 died with signal 4 (Illegal instruction)
Pete Wyckoff
pw at osc.edu
Wed Jan 24 15:18:31 EST 2007
bmecklenburg at colsa.com wrote on Wed, 24 Jan 2007 13:02 -0600:
> I have some questions on what I am doing wrong in the setup or
> implementation of running some pbs jobs. I am trying to combine two
> clusters we have. One is an 128 node IBM Open Power 5 cluster (marvin)
> running SLES 9 and the other is a 128 node Apple Xserve cluster 9 (otis).
> The IBM cluster has pretty much remained in tact and we added the Apple
> cluster to it by putting OpenSuse 10.2 on them.
Most of this sounds like a job for the torque list. Be sure to send
the same thing to them. You may want to point out if these two
types of machines are really the same architecture and hence run the
same binaries, or not.
> When I try to submit using either mpirun or mpiexec, the maui log gives this
> error:
> 01/24 11:21:58 INFO: job '1661' Priority: 1
> 01/24 11:21:58 INFO: job '1661' Priority: 1
> 01/24 11:21:58 MResDestroy(1661)
> 01/24 11:21:58 MResChargeAllocation(1661,2)
> 01/24 11:21:58 INFO: 256 feasible tasks found for job 1661:0 in
> partition DEFAULT (256 Needed)
> 01/24 11:21:58 ALERT: inadequate tasks to allocate to job 1661:0 (176 <
> 256)
> 01/24 11:21:58 ERROR: cannot allocate nodes to job '1661' in partition
> DEFAULT
This is certainly the crux of your problem. I do not have an answer
though.
> I have changed back and forth using mpirun and mpiexec.
>
> When using mpiexec, the job sits in the queue and when I try to qrun the job
> I get the following errors
Choosing mpiexec or mpirun does not matter for getting the job out
of the queue. Your problem lies elsewhere.
> number of processors = 256 186 r08n38
> number of processors = 256 151 r09n13
> number of processors = 256 118 r09n29
> MX:r08n26:Got a NACK:req status 8:Remote endpoint is closed
> type (8): connect
> state (0x0):
> requeued: 1 (timeout=510000ms)
> dest: 00:60:dd:48:1a:b4 (r10n15:0)
> partner: peer_index=22, endpoint=1, seqnum=0x0
> connect_seq: 0x1
>
> This continues on for many more of the compute nodes until it comes down to
> this error:
> MX:Aborting
> mpiexec: Warning: tasks 0-173,176-179,184-192,194-197 died with signal 4
> (Illegal instruction).
> mpiexec: Warning: tasks 174-175,180-181,193,198-255 exited with status 1.
> mpiexec: Warning: tasks 182-183 died with signal 15 (Terminated).
>
> I was not sure if the mpiexec warning was a result of mx aborting because
> the job failed.
Yeah, everything died. Mpiexec is just trying to summarize how
terrible it all was.
-- Pete
More information about the mpiexec
mailing list