one job only
Pete Wyckoff
pw at osc.edu
Fri Apr 14 16:29:47 EDT 2006
fpbeekhof at gmail.com wrote on Fri, 14 Apr 2006 18:13 +0200:
> We seem to have a problem with mpiexec 0.80 + torque 1.2.0p1 +
> mpich-1.2.6..14b-gcc-4.1.0
> It is possible to start a job, but when trying to launch a second job, it
> will terminate immediately, and the stderr file contains:
>
> mpiexec: Error: poll_or_block_event: tm_poll remote 15010: System error.
The PBS mom was unhappy for some reason. You might check the mom
logs and see if anything sticks out.
> Strace says:
>
> 31782 bind(5, {sa_family=AF_INET, sin_port=htons(1023), sin_addr=inet_addr("
> 0.0.0.0")}, 16) = -1 EACC
> ES (Permission denied)
It's rather hard to use strace because of the setuid binary needed
by PBS to authenticate itself. You can grep for "strace" in
mpiexec.c and change that section to "#if 1" if you want to strace.
But I don't think it will show much.
> However, this problem does not occur with the first job. Also,
> /usr/sbin/pbs_iff -t myri0 15001
> doesn't generate any output that indicates error.
What do you mean second job? Are these concurrent, i.e.
mpiexec -n 1 --comm=none sleep 2400 &
mpiexec -n 1 hello
Or one after the other (without the "&" above)?
I wonder if the first job terminated cleanly. Maybe you could run
that one with "mpiexec -v -v " to see what happens. And the same
on the second one. And my usual advice for debugging is to turn off
as much as possible. If it works without "-nostdin -nostdout" that
will tell us a bit more.
-- Pete
P.S. Your gmail account sends html mail; might want to turn that
off for mail to lists.
More information about the mpiexec
mailing list