MPIEXEC and Intel MPI library 1.0.1

Thomas Zeiser thomas.zeiser at rrze.uni-erlangen.de
Mon May 2 13:38:09 EDT 2005


Hello Anton,

On Mon, May 02, 2005 at 06:06:52PM +0200, Anton Starikov wrote:
> Can you describe in detais:
> 1) Which fabric do you use?

the default TCP socket device on IA32 or EM64T nodes.

> 2) You wrote that version is 1.0.1, does it  mean actaully version 
> 1.0.035? So far it is the last version available form Intel. I've tested 
> my patch with this version.

the Package ID is l_mpi_pu_1.0.035; the releasenotes call it v1.0.1
(it's available on premier.intel since April 15)

probably we are talking about the same version

> Finalize hang itself can be related to general problem with MPICH2, it's 
> described on the mpiexec site. Try tp add "-kill" into mpiexec options.

the -kill option does not help in my case (except that I only have
to kill one process manually to terminate the program); this
probaby means NO tasks tries to exit!?

It also might be related to the strange hostname output I see when
starting the program with mpiexec. (When I use the Intel-MPI
mpdboot/mpiexec/mpdallexit sequence, the output is correct and the
program terminates normally.)

> Anton Starikov.

Regards,

thomas


> 
> Thomas Zeiser wrote:
> >Dear All!
> >
> >I just tested the patch to get Intel MPI running.  
> >
> >With version 1.0 of Intel MPI everything is fine. 
> >
> >However, when I try the recent upate (Intel MPI 1.0.1) I get very
> >strange results:
> >
> >- I start the MPI program with
> >  mpiexec -comm pmi [-verbose] ./test-f-g77-intelmpi101
> >
> >- the processes are correctly started on all nodes (twice on
> >  snode164 and snode164; veryfied with "ps")
> >
> >  mpiexec: resolve_exe: using absolute exe "./test-f-g77-intelmpi101".
> >  mpiexec: accept_pmi_conn: got request: cmd=initack pmiid=0.
> >  mpiexec: accept_pmi_conn: rank 0 checks in.
> >  mpiexec: accept_pmi_conn: got request: cmd=init pmi_version=1.1.
> >  mpiexec: accept_pmi_conn: got request: cmd=initack pmiid=1.
> >  mpiexec: accept_pmi_conn: rank 1 checks in.
> >  mpiexec: accept_pmi_conn: got request: cmd=init pmi_version=1.1.
> >  mpiexec: accept_pmi_conn: got request: cmd=initack pmiid=2.
> >  mpiexec: accept_pmi_conn: rank 2 checks in.
> >  mpiexec: accept_pmi_conn: got request: cmd=init pmi_version=1.1.
> >  mpiexec: accept_pmi_conn: got request: cmd=initack pmiid=3.
> >  mpiexec: accept_pmi_conn: rank 3 checks in.
> >  mpiexec: accept_pmi_conn: got request: cmd=init pmi_version=1.1.
> >  mpiexec: All 4 tasks started.
> >
> >- in the test program all MPI processes get their hostname using
> >  MPI_GET_PROCESSOR_NAME and send it withMPI_SEND to the master.
> >  The master receives the messages with MPI_RECV and outputs them
> >  (it's the simple test.f* program form the Intel MPI test
> >  directory). The outut is the following:
> >
> >  Hello world: rank  0 of  4 running on snode164
> >  Hello world: rank  1 of  4 running on snode164
> >  Hello world: rank  2 of  4 running on snode164
> >  Hello world: rank  3 of  4 running on snode164
> >
> >  All processes seem to be running on the same node!
> >
> >- now MPI_FINALIZE comes in the program. However, the processes
> >  hang. When I now kill step by step all processes I get
> >
> >  accept_pmi_conn: waiting for info
> >  accept_pmi_conn: waiting for info
> >  accept_pmi_conn: waiting for info
> >  accept_pmi_conn: waiting for info
> >  wait_one_task_start: evt = 2, task 0 host snode164
> >  wait_one_task_start: evt = 3, task 1 host snode164
> >  wait_one_task_start: evt = 4, task 2 host snode163
> >  wait_one_task_start: evt = 5, task 3 host snode163
> >  wait_tasks: waiting for snode164 snode164 snode163 snode163
> >  wait_tasks: waiting for snode164 snode163 snode163
> >  wait_tasks: waiting for snode164 snode163
> >  wait_tasks: waiting for snode163
> >
> >  Killed
> >  ABORT - process 3: failure: Other MPI error
> >  mpiexec: wait_tasks: numspawned = 4, got evt 6 for tid 2 host snode164 
> >  status 1.
> >  mpiexec: wait_tasks: numspawned = 3, got evt 9 for tid 5 host snode163 
> >  status 13.
> >  Killed
> >  mpiexec: wait_tasks: numspawned = 2, got evt 7 for tid 3 host snode164 
> >  status 1.
> >  Killed
> >  mpiexec: wait_tasks: numspawned = 1, got evt 8 for tid 4 host snode163 
> >  status 1.
> >  mpiexec: Warning: tasks 0-2 exited with status 1.
> >  mpiexec: Warning: task 3 exited with status 13.
> >
> >
> >
> >
> >Any ideas (exept using totalview to get a better insight)?
> >
> >
> >Kind regards,
> >
> >Thomas Zeiser


More information about the mpiexec mailing list