unknown signal
Glen Beane
beaneg at umcs.maine.edu
Thu Oct 14 13:57:03 EDT 2004
Looks like it does add 10,000 to the signal.
if (WIFEXITED(statloc))
exiteval = WEXITSTATUS(statloc);
else if (WIFSIGNALED(statloc))
exiteval = WTERMSIG(statloc) + 10000;
else
exiteval = 1;
On Oct 14, 2004, at 1:50 PM, Pete Wyckoff wrote:
> beaneg at umcs.maine.edu wrote on Thu, 14 Oct 2004 12:39 -0400:
>> mpiexec: Warning: task 0 died with signal 9748 (Unknown signal: 9748).
>>
>> Can anyone help me out here? What's going on? I'm pretty sure this
>> isn't my code, since it works fine up too 200+ processors. when I ran
>> on 240 processors the job rank OK for 22 minutes, then barfed with a
>> bunch of p4 errors and I got this error from mpiexec. The job should
>> have ran for aprox 2 hours.
>
> PBS returns a status when the job exits. In the linux architecture-
> specific code and a few other arches too, there's a chunk that adds
> 256 to the signal number if it died from a signal, else just returns
> the exit value (0..255 by unix convention I think).
>
> Here's the comment from mpiexec.c:
>
> /*
> * In PBS on linux and other arches, scan_for_terminated()
> interprets
> * the return value from wait for us, whether we like it or not:
> *
> * if (WIFEXITED(statloc))
> * exiteval = WEXITSTATUS(statloc);
> * else if (WIFSIGNALED(statloc))
> * exiteval = WTERMSIG(statloc) + 0x100;
> * else
> * exiteval = 1;
> *
> * The magic constant below fixes that.
> */
>
> You might take a look at your mom code (mac/darwin/osx presumably) and
> try to figure out from where it gets such a massive number. Note that
> 9748 + 0x100 = 10004. Could be it adds 10000 to the signal (SIGILL in
> this case)? If you can figure out what's going on, let us know and
> I'll
> try to figure out how to hack it into the code a bit better.
>
> -- Pete
>
More information about the mpiexec
mailing list