one job only

Fokko Beekhof fpbeekhof at gmail.com
Fri Apr 14 17:34:33 EDT 2006


Hi,

> What do you mean second job?  Are these concurrent, i.e.

I mean that there is only one job running on the entire cluster.
No-one can successfully run a second job. Jobs can be submitted, jobs
will be scheduled and started, but only the first started job will
run. All others will fail with the message :
mpiexec: Error: poll_or_block_event: tm_poll remote 15010: System error.

It appears that one of the nodes is malfunctioning: node myri21 does
not automount NFS filesystems, including my home directory. Somehow
this node is allocated for each started job. Starting 2 extra jobs
simultaniously resulted in one dead job (node myri21 allocated) but
the other running just fine.

 > I wonder if the first job terminated cleanly.  Maybe you could run
> that one with "mpiexec -v -v " to see what happens.  And the same

This didn't show much. I'll just try to find someone with a root
password and have the offending node shut down until the sysadmin
returns from vacation :-)

Thanks and best regards,

Fokko Beekhof

> P.S.  Your gmail account sends html mail; might want to turn that
> off for mail to lists.

Sorry, it's my first post. It should be fixed in this one.


More information about the mpiexec mailing list