tm_obit

Pete Wyckoff pw at osc.edu
Mon Jun 30 09:55:21 EDT 2003


curt at princeton.edu said on Tue, 24 Jun 2003 17:37 -0400:
> I've installed mpiexec on our 32 node x 2 cpu/node cluster with the
> following software configuration:
> 
> OpenPBS-2.3.16 (patches including mpiexec)
> mpich-1.2.5 (--with-comm=shared among other things)
> mpiexec-0.74 (--with-default-comm=mpich-p4)
> icc-7.1
> RedHat 7.3
> 
> When I execute runtest.pl I get errors like the following in many of the
> output files (for jobs using multiple nodes):
> 
> testho.10466.63:mpiexec: Error: start_tasks: tm_obit 3: tm: not found.
> 
> The output looks fine - everything expected is there, and there is no
> extra output other than this error message from the task manager.  The
> task reporting an error is usually 3, but occassionally 1 or 2.  I'm
> running a vanilla runtest.pl, so available_nodes=4 and smpsize=2.
> 
> Any ideas what I'm doing wrong here?

Maybe it's not you.  The API for TM is a bit odd.  We tell it to spawn
a job, then we tell it to tell us if the job dies.  There is a bit of a
race condition in there.  The error "not found" means that TM does not
know about the task for which we demanded an obit.  My guess is that the
process runs so quickly that it exits, apparently fine, before the obit
request is issued.

If you think this could be the case, let me know and I'll make that not
be an error.  Perhaps test by adding some sleep() in the parallel code
"hello.c" somewhere to see if the error goes away, if you're interested.
Do you suspect your machines are very fast, or are you testing on large
numbers of nodes?  It would be nice to hear that real parallel codes
work okay on your system, and that this is just an artifact of the
"hello" test code.

Thanks,

		-- Pete



More information about the mpiexec mailing list