segfault

Glen Beane beaneg at umcs.maine.edu
Fri Sep 5 16:03:49 EDT 2003


I swapped out the ram and myrinet card with no effect,  so I swapped out
the whole node.  As I suspected the problem then disappeared. These are
diskless nodes, so basically it has to be the processors or some
component on the motherboard.  Does anyone know of a hardware failure
that would cause mpiexec controlled jobs to segfault right after/during
startup, but jobs started with ssh to run fine? I have the feeling
something is starting to go, and since its not ram or myrinet there
isn't much else it could be other than mobo and CPUs.  I guess I could
swap the processors and motherboard separately to find out where the
problem is.


On Fri, 2003-09-05 at 10:20, Glen Beane wrote:
> I have a rather strange problem:  suddenly one node in my cluster
> started crashing jobs, and the error was always that the mpi tasks on
> that particular node had died with a signal 11(segfault).  This problem
> didn't happen with mpirun.ch_gm, only with mpiexec.  The strange thing
> is that all my nodes are diskless, so they all have the same exact
> setup, and no other node has this problem.  I've rebooted the node to
> reset the ramdisk image, done memory tests, ran jobs with mpirun.ch_gm,
> and no problems show up.  It seems really strange to me that this one
> node would be crashing jobs with mpiexec when all the other identical
> nodes have no problem.  This started about a week ago.  I'm going to
> upgrade mpiexec later today.
> 
> Does anyone have any ideas?
> 
> _______________________________________________
> mpiexec mailing list
> mpiexec at osc.edu
> http://email.osc.edu/mailman/listinfo/mpiexec




More information about the mpiexec mailing list