mpiexec: Warning: tasks 0-168, 172-175 died with signal 4 (Illegal instruction)

Pete Wyckoff pw at osc.edu
Wed Jan 24 15:18:31 EST 2007


bmecklenburg at colsa.com wrote on Wed, 24 Jan 2007 13:02 -0600:
> I have some questions on what I am doing wrong in the setup or
> implementation of running some pbs jobs.  I am trying to combine two
> clusters we have. One is an 128 node IBM Open Power 5 cluster (marvin)
> running SLES 9 and the other is a 128 node Apple Xserve cluster 9 (otis).
> The IBM cluster has pretty much remained in tact and we added the Apple
> cluster to it by putting OpenSuse 10.2 on them.

Most of this sounds like a job for the torque list.  Be sure to send
the same thing to them.  You may want to point out if these two
types of machines are really the same architecture and hence run the
same binaries, or not.

> When I try to submit using either mpirun or mpiexec, the maui log gives this
> error:
> 01/24 11:21:58 INFO:     job '1661' Priority:        1
> 01/24 11:21:58 INFO:     job '1661' Priority:        1
> 01/24 11:21:58 MResDestroy(1661)
> 01/24 11:21:58 MResChargeAllocation(1661,2)
> 01/24 11:21:58 INFO:     256 feasible tasks found for job 1661:0 in
> partition DEFAULT (256 Needed)
> 01/24 11:21:58 ALERT:    inadequate tasks to allocate to job 1661:0 (176 <
> 256)
> 01/24 11:21:58 ERROR:    cannot allocate nodes to job '1661' in partition
> DEFAULT

This is certainly the crux of your problem.  I do not have an answer
though.

> I have changed back and forth using mpirun and mpiexec.
> 
> When using mpiexec, the job sits in the queue and when I try to qrun the job
> I get the following errors

Choosing mpiexec or mpirun does not matter for getting the job out
of the queue.  Your problem lies elsewhere.

> number of processors =   256 186  r08n38
> number of processors =   256 151  r09n13
> number of processors =   256 118  r09n29
> MX:r08n26:Got a NACK:req status 8:Remote endpoint is closed
>         type (8): connect
>         state (0x0):
>         requeued: 1 (timeout=510000ms)
>         dest: 00:60:dd:48:1a:b4 (r10n15:0)
>         partner: peer_index=22, endpoint=1, seqnum=0x0
>         connect_seq: 0x1
> 
> This continues on for many more of the compute nodes until it comes down to
> this error:
> MX:Aborting
> mpiexec: Warning: tasks 0-173,176-179,184-192,194-197 died with signal 4
> (Illegal instruction).
> mpiexec: Warning: tasks 174-175,180-181,193,198-255 exited with status 1.
> mpiexec: Warning: tasks 182-183 died with signal 15 (Terminated).
> 
> I was not sure if the mpiexec warning was a result of mx aborting because
> the job failed.  

Yeah, everything died.  Mpiexec is just trying to summarize how
terrible it all was.

		-- Pete


More information about the mpiexec mailing list