unknown signal
Pete Wyckoff
pw at osc.edu
Thu Oct 14 13:50:09 EDT 2004
beaneg at umcs.maine.edu wrote on Thu, 14 Oct 2004 12:39 -0400:
> mpiexec: Warning: task 0 died with signal 9748 (Unknown signal: 9748).
>
> Can anyone help me out here? What's going on? I'm pretty sure this
> isn't my code, since it works fine up too 200+ processors. when I ran
> on 240 processors the job rank OK for 22 minutes, then barfed with a
> bunch of p4 errors and I got this error from mpiexec. The job should
> have ran for aprox 2 hours.
PBS returns a status when the job exits. In the linux architecture-
specific code and a few other arches too, there's a chunk that adds
256 to the signal number if it died from a signal, else just returns
the exit value (0..255 by unix convention I think).
Here's the comment from mpiexec.c:
/*
* In PBS on linux and other arches, scan_for_terminated() interprets
* the return value from wait for us, whether we like it or not:
*
* if (WIFEXITED(statloc))
* exiteval = WEXITSTATUS(statloc);
* else if (WIFSIGNALED(statloc))
* exiteval = WTERMSIG(statloc) + 0x100;
* else
* exiteval = 1;
*
* The magic constant below fixes that.
*/
You might take a look at your mom code (mac/darwin/osx presumably) and
try to figure out from where it gets such a massive number. Note that
9748 + 0x100 = 10004. Could be it adds 10000 to the signal (SIGILL in
this case)? If you can figure out what's going on, let us know and I'll
try to figure out how to hack it into the code a bit better.
-- Pete
More information about the mpiexec
mailing list