larger jobs fail regularly

Pete Wyckoff pw at osc.edu
Tue May 8 14:54:04 EDT 2007


thomas.zeiser at rrze.uni-erlangen.de wrote on Tue, 08 May 2007 20:31 +0200:
> On Tue, May 08, 2007 at 11:44:37AM -0400, Troy Baer wrote:
> > On Tue, 2007-05-08 at 17:38 +0200, Thomas Zeiser wrote:
> > > starting larger jobs (128 CPUs on 32 nodes) fails quit often on our
> > > new system with messages which seem to indicate a problem in the 
> > > communication of mpiexec with torque.
> > > We are running SuSE SLES9 (x96_64), torque-2.1.6/2.1.8 and use mpiexec-0.82.
> > 
> > The error messages in your message show several PMI errors, so I suspect
> > there may be a bad interaction between mpiexec and your MPI library.
> > What MPI implementation are you using?
> 
> So far I mostly tested Intel-MPI 3.0-043 with an average success
> rate of only <20% if 128 CPUs are used.
> 
> I now also did some tests with mvapich2-0.9.8: 4 out of 5 runs succeeded,
> but the failing one gave very similar messages!
> 
> mpiexec: Warning: poll_or_block_event: evt 127 remote system error.
> [cli_8]: PMIU_parse_keyvals: unexpected key delimiter at character 1 in 0

Looks the same as your Intel results.  This first warning is from
mpiexec.  The rest of the "cli_" messages are from the surviving
tasks.  I suspect that things go bad after this warning and mpiexec
decides to start killing tasks off.  As that happens they complain
in their parsing routines while trying to read PMI messages,
generating all these "cli_" messages that are not the root of the
problem.

Here's a little patch to try to figure out which node is the reason
for the initial warning message.  It essentially converts that "evt
127" into a hostname so you can go look at the PBS mom log and see
if there are any hints.

What this warning means.  The call tm_poll() returns two error
numbers.  The first is the return value of the function and handles
things like whether you passed in the right arguments etc.  The
second is an error value reported from whatever node generated the
event.  All events get forwarded to your local mother superior and
then down to mpiexec, hence the need for the second error value.
The error code TM_ESYSTEM seems to be the generic catchall for
"something bad happened" in PBS.  There's nothing the job launcher
can do to narrow it down futher.

Let me know if you find something in the mom log.  The one hope is
that we could try to track it down to something, say, timing or
network congestion related, and slow things down in mpiexec to try
to deal with it.  Really have no clue yet though.

		-- Pete

Index: event.c
===================================================================
--- event.c	(revision 401)
+++ event.c	(working copy)
@@ -227,6 +227,7 @@ poll_or_block_event(int block)
     tm_event_t evt;
     int remote_tm_error;
     int err;
+    int remote_system_warning = 0;
 
     if (concurrent_master) {
       redo:
@@ -239,7 +240,7 @@ poll_or_block_event(int block)
 		    ;
 		else if (remote_tm_error == TM_ESYSTEM)
 		    /* issue warning, but look at event anyway */
-		    warning("%s: evt %d remote system error", __func__, evt);
+		    remote_system_warning = 1;
 		else
 		    error_tm_or_pbs(remote_tm_error,
 		      "%s: tm_poll remote %d", __func__, remote_tm_error);
@@ -259,6 +260,10 @@ poll_or_block_event(int block)
 		error("%s: no event structure for %d", __func__, evt);
 	}
 
+	if (remote_system_warning)
+	    warning("%s: evt %d task %d on %s: remote system error", __func__,
+	    	    evt, ep->task, nodes[tasks[ep->task].node].name);
+
 	/*
 	 * Check stdio listener.  Non-master equivalent of this code is
 	 * pushed down inside select() in concurrent_poll.


More information about the mpiexec mailing list