segfault
Glen Beane
beaneg at umcs.maine.edu
Fri Sep 5 16:03:49 EDT 2003
I swapped out the ram and myrinet card with no effect, so I swapped out
the whole node. As I suspected the problem then disappeared. These are
diskless nodes, so basically it has to be the processors or some
component on the motherboard. Does anyone know of a hardware failure
that would cause mpiexec controlled jobs to segfault right after/during
startup, but jobs started with ssh to run fine? I have the feeling
something is starting to go, and since its not ram or myrinet there
isn't much else it could be other than mobo and CPUs. I guess I could
swap the processors and motherboard separately to find out where the
problem is.
On Fri, 2003-09-05 at 10:20, Glen Beane wrote:
> I have a rather strange problem: suddenly one node in my cluster
> started crashing jobs, and the error was always that the mpi tasks on
> that particular node had died with a signal 11(segfault). This problem
> didn't happen with mpirun.ch_gm, only with mpiexec. The strange thing
> is that all my nodes are diskless, so they all have the same exact
> setup, and no other node has this problem. I've rebooted the node to
> reset the ramdisk image, done memory tests, ran jobs with mpirun.ch_gm,
> and no problems show up. It seems really strange to me that this one
> node would be crashing jobs with mpiexec when all the other identical
> nodes have no problem. This started about a week ago. I'm going to
> upgrade mpiexec later today.
>
> Does anyone have any ideas?
>
> _______________________________________________
> mpiexec mailing list
> mpiexec at osc.edu
> http://email.osc.edu/mailman/listinfo/mpiexec
More information about the mpiexec
mailing list