tm-problems, smp-prob?

Pete Wyckoff pw at osc.edu
Tue Feb 25 15:07:14 EST 2003


stefan.friedel at iwr.uni-heidelberg.de said on Tue, 25 Feb 2003 16:34 +0100:
> Hi *,
> I've got 2 different questions: (Linux Cluster, running Openpbs 2.3.16, mpich-gm, Linux/Debian woody, 256x2 Nodes) -
> 
> - what means the following??:
> 
> #################
> Error received by batch job output Feb 24 23:31 trailing.queue.e1279
> 
> mpiexec: Error: wait_tasks: tm_poll remote: tm: system error.
> 
> Asynchron communication in mpicall never finished or died
> 
> DDD [000] ERROR 04200: receive-timeout for IF 14 in DDD_IFAExchange
> DDD [000] ERROR 04201:   waiting for message (from proc 1, size 2080)
> #################
> 
> Any hints?

I've never seen any of the above messages except for the one-line
mpiexec one, which unfortunately gives almost no information.  All that
is known is that PBS became "unhappy" while mpiexec was using a PBS
library call tm_poll() that does a blocking wait for the next event
which PBS will deliver.  PBS gives no useful information about why it
was unhappy.  You may wade through your pbs_mom logs and see if
something appears there.  Out of resource, broken disk, etc. may cause
PBS to be unable to spawn or kill a task and lead to that message, for
example.

What software printed out the other lines:  "Error received ...",
"Asynchron ..." and "DDD ..."?  Which mpich-gm version are you using?

> - with one of our applications we have the problem that it is just running on one node/one processor configuration with
> +mpiexec (same system, same pbs job with the mpirun.ch_gm/mpirun from myri runs fine - eg. nodes=16:ppn=2 or something).
> +We then get errors like:
> 
> ##################
> [3] Error: Unable to get GM local node id !
> [3] Error: write to socket failed !
> [2] Error: Unable to get GM local node id !
> [2] Error: write to socket failed !
> ##################
> 
> any hints here??

This is a GM error that usually means something is wrong with the
Myrinet card:  no board, routes not loaded, wrong map file, etc.  It
really should not matter whether you use mpiexec or mpirun.  Run
"gm_board_info" on machines 2 and 3 (mpi rank numbers) and see if it
gives you any information.  You can use "-v" to mpiexec to see the
mapping between ranks and node names.

		-- Pete



More information about the mpiexec mailing list